Secure and Reliable Network Applications
Kenneth P. Birman
Department of Computer Science
Ithaca, New York 14853
Cover image: line drawing of the golden gate bridge looking towards San Francisco?
@ Copyright 1995, Kenneth P. Birman. All rights reserved. This document may not be copied, electronically or physically, in whole or in part, or otherwise disseminated without the author’s prior written permission.
Trademarks Cited in the Text
Preface and Acknowledgements
A User’s Guide to This Book
Part I: Basic Distributed Computing Technologies
1.2 Components of a Reliable Distributed Computing System
1.2.1 Communications Technology
1.2.2 Basic transport and network services
1.2.3 Reliable transport software and communication support
1.2.4 "Middleware": Software tools, utilities, and programming languages
1.2.5 Distributed computing environments
1.2.6 End-user applications
1.3 Critical Dependencies
1.4 Next Steps
1.5 Additional Reading
2. Communication Technologies
2.1 Types of Communication Devices
2.5 B-ISDN and the Intelligent Network
2.7 Cluster and Parallel Architectures
2.8 Next steps
2.9 Additional Reading
3. Basic Communication Services
3.1 Communications Standards
3.3 Internet Protocols
3.3.1 Internet Protocol: IP layer
3.3.2 Transport Control Protocol: TCP
3.3.3 User Datagram Protocol: UDP
3.3.4 Internet Packet Multicast Protocol: IP Multicast
3.5 End-to-end Argument
3.6 O/S Architecture Issues, Buffering, Fragmentation
3.7 Xpress Transfer Protocol
3.8 Next Steps
3.9 Additional Reading
4. RPC and the Client-Server Model
4.1 RPC Protocols and Concepts
4.2 Writing an RPC-based Client or Server Program
4.3 The RPC Binding Problem
4.4 Marshalling and Data Types
4.5 Associated Services
4.5.1 Naming services
4.5.2 Time services
4.5.3 Security services
4.5.4 Threads packages
4.6 The RPC Protocol
4.7 Using RPC in Reliable Distributed Systems
4.8 Related Readings
5.1 Sliding Window Protocols
5.1.1 Error Correction
5.1.2 Flow Control
5.1.3 Dynamic Adjustment of Window Size
5.1.4 Burst Transmission Concept
5.2 Negative-Acknowledgement Only
5.3 Reliability, Fault-tolerance, and Consistency in Streams
5.4 RPC over a Stream
5.5 Related Readings
6. CORBA and Object-Oriented Environments
6.1 The ANSA Project
6.2 Beyond ANSA to CORBA
6.3 OLE-2 and Network OLE
6.4 The CORBA Reference Model
6.6 IDL and ODL
6.8 Naming Service
6.10 Life Cycle Service
6.11 Persistent Object Service
6.12 Transaction Service
6.13 Inter-Object Broker Protocol
6.14 Future CORBA Services
6.15 Properties of CORBA Solutions
6.16 Related Readings
7. Client-Server Computing
7.1 Stateless and Stateful Client-Server Interactions
7.2 Major Uses of the Client-Server Paradigm
7.3 Distributed File Systems
7.4 Stateful File Servers
7.5 Distributed Database Systems
7.6 Applying Transactions to File Servers
7.7 Message Oriented Middleware
7.8 Related Topics
7.9 Related Readings
8. Operating System Support for High Performance Communication
8.1 Lightweight RPC
8.2 Fbuf’s and the xKernel Project
8.3 Active Messages
8.4 Beyond Active Messages: U-Net
8.5 Protocol Compilation Techniques
8.6 Related Readings
Part II: The World Wide Web
9. The World Wide Web
9.1 Related Readings
10. The Major Web Technologies
10.1 Hyper-Text Markup Language (HTML)
10.2 Virtual Reality Markup Language (VRML)
10.3 Universal Resource Locators (URLs)
10.4 Hyper-Text Transport Protocol (HTTP)
10.5 Representations of Image Data
10.6 Authorization and Privacy Issues
10.7 Web Proxy Servers
10.8 Java, HotJava, and Agent Based Browsers
10.9 GUI Builders and Other Distributed CASE Tools
10.10 Tacoma and the Agent Push Model
10.11 Web Search Engines and Web Crawlers
10.12 Important Web Servers
10.13 Future Challenges
10.14 Related Readings
11. Related Internet Technologies
11.1 File Transfer Tools
11.2 Electronic Mail
11.3 Network Bulletin Boards (newsgroups)
11.4 Message Oriented MiddleWare Systems (MOMS)
11.5 Message Bus Architectures
11.6 Internet Firewalls and Gateways
11.7 Related Readings
Part III: Reliable Distributed Computing
12. How and Why Computer Systems Fail
12.1 Hardware Reliability and Trends
12.2 Software Reliability and Trends
12.3 Other Sources of Downtime
12.5 Detecting failures
12.6 Hostile Environments
12.7 Related Readings
13. Guaranteeing Behavior in Distributed Systems
13.1 Consistent Distributed Behavior
13.2 Warning: Rough Road Ahead!
13.3 Membership in a Distributed System
13.4 Time in Distributed Systems
13.5 Failure Models and Reliability Goals
13.6 Reliable Computing in a Static Membership Model
13.6.1 The Distributed Commit Problem
126.96.36.199 Two-Phase Commit
188.8.131.52 Three-Phase Commit
13.6.2 Reading and Updating Replicated Data with Crash Failures
13.7 Replicated Data with Non-Benign Failure Modes
13.8 Reliability in Asynchronous Environments
13.9 The Dynamic Group Membership Problem
13.10 The Group Membership Problem
13.10.1 Protocol used to track GMS Membership
13.10.2 GMS Protocol to Handle Client Add and Join Events
13.10.3 GMS Notifications With Bounded Delay
13.10.4 Extending the GMS to Allow Partition and Merge Events
13.11 Dynamic Process Groups and Group Communication
13.11.1 Group Communication Primitives
13.12 Delivery Ordering Options
184.108.40.206 Non-Uniform Failure-Atomic Group Multicast
220.127.116.11 Dynamically Uniform Failure-Atomic Group Multicast
13.12.2 Dynamic Process Groups
13.12.3 View-Synchronous Failure Atomicity
13.12.4 Summary of GMS Properties
13.12.5 Ordered Multicast
18.104.22.168 Fifo Order
22.214.171.124 Causal Order
126.96.36.199.1 Causal ordering with logical timestamps
188.8.131.52.2 Causal ordering with vector timestamps
184.108.40.206.3 Timestamp compression
220.127.116.11.4 Causal multicast and consistent cuts
18.104.22.168.5 Exploiting Topological Knowledge
22.214.171.124 Total Order
13.13 Communication From Non-Members to a Group
13.14 Communication from a Group to a Non-Member
13.16 Related Readings
14. Point-to-Point and Multigroup Considerations
14.1 Causal Communication Outside of a Process Group
14.2 Extending Causal Order to Multigroup Settings
14.3 Extending Total Order to Multigroup Settings
14.4 Causal and Total Ordering Domains
14.5 Multicasts to Multiple Groups
14.6 Multigroup View Management Protocols
14.7 Related Reading
15. The Virtually Synchronous Execution Model
15.1 Virtual Synchrony
15.2 Extended Virtual Synchrony
15.3 Virtually Synchronous Algorithms and Tools
15.3.1 Replicated Data and Synchronization
15.3.2 State transfer to a joining process
15.3.4 Primary-Backup Fault Tolerance
15.3.5 Coordinator-Cohort Fault-Tolerance
15.4 Related Readings
16. Consistency in Distributed Systems
16.1 Consistency in the Static and Dynamic Membership Models
16.2 General remarks Concerning Causal and Total Ordering
16.3 Summary and Conclusion
16.4 Related Reading
17. Retrofitting Reliability into Complex Systems
17.1 Wrappers and Toolkits
17.1.1 Wrapper Technologies
126.96.36.199 Wrapping at Object Interfaces
188.8.131.52 Wrapping by Library Replacement
184.108.40.206 Wrapping by Object Code Editing
220.127.116.11 Wrapping With Interposition Agents and Buddy Processes
18.104.22.168 Wrapping Communication Infrastructures: Virtual Private Networks
22.214.171.124 Wrappers: Some Final Thoughts
17.1.2 Introducing Robustness in Wrapped Applications
17.1.3 Toolkit Technologies
17.1.4 Distributed Programming Languages
17.2 Wrapping a Simple RPC server
17.3 Wrapping a Web Server
17.4 Hardening Other Aspects of the Web
17.5 Unbreakable Stream Connections
17.5.1 Reliability Options for Stream Communication
17.5.2 An Unbreakable Stream That Mimics TCP
17.5.3 Non-Determinism and Its Consequences
17.5.4 Dealing With Arbitrary Non-Determinism
17.5.5 Replicating the IP Address
17.5.6 Maximizing Concurrency by Relaxing Multicast Ordering
17.5.7 State Transfer Issues
17.6 Building a Replicated TCP Protocol Using a Toolkit
17.7 Reliable Distributed Shared Memory
17.7.1 The shared memory wrapper abstraction
17.7.2 Memory coherency options for distributed shared memory
17.7.3 False sharing
17.7.4 Demand paging and intelligent prefetching
17.7.5 Fault-tolerance issues
17.7.6 Security and protection considerations
17.7.7 Summary and discussion
17.8 Related Readings
18. Reliable Distributed Computing Systems
18.1 Architectural Considerations in Reliable Systems
18.2 Horus: A Flexible Group Communications System
18.2.1 A layered process group architecture
18.3 Protocol stacks
18.4 Using Horus to Build a Robust Groupware Application
18.5 Using Horus to Harden CORBA applications
18.6 Basic Performance of Horus
18.7 Masking the Overhead of Protocol Layering
18.7.1 Reducing Header Overhead
18.7.2 Eliminating Layered Protocol Processing Overhead
18.7.3 Message Packing
18.7.4 Performance of Horus with the Protocol Accelerator
18.9 Related Readings
19. Security Options for Distributed Settings
19.1 Perimeter Defense Technologies
19.2 Access Control Technologies
19.3 Authentication Schemes and Kerberos
19.3.1 RSA and DES
19.3.3 ONC security and NFS
19.4 Availability and Security
19.5 Related Readings
20. Clock Synchronization and Synchronous Systems
20.1 Clock Synchronization
20.2 Timed-asynchronous Protocols
20.3 Adapting Virtual Synchrony for Real-Time Settings
20.4 Related Readings
21. Transactional Systems
21.1 Implementation of a Transactional Storage System
21.1.1 Write-ahead logging
21.1.2 Persistent data seen "through" an updates list
21.1.3 Non-distributed commit actions
21.2 Distributed Transactions and Multi-Phase Commit
21.3 Transactions on Replicated Data
21.4 Nested Transactions
21.4.1 Comments on the nested transaction model
21.5 Weak Consistency Models
21.5.1 Epsilon serializability
21.5.2 Weak and strong consistency in partitioned database systems
21.5.3 Transactions on multi-database systems
21.5.5 Transactions in Real-Time Systems
21.6 Advanced Replication Techniques
21.7 Related Readings
22. Probabilistic Protocols
22.1 Probabilistic Protocols
22.2 Other applications of gossip protocols
22.3 Hayden’s pbcast primitive
22.3.1 Unordered pbcast protocol
22.3.2 Adding Total Ordering
22.3.3 Probabilistic Reliability and the Bimodal Delivery Distribution
22.3.4 An Extension to Pbcast
22.3.5 Evaluation and Scalability
126.96.36.199 Message cost and fanout.
22.4 An Unscalable System Model
22.5 Replicated Data using Pbcast
22.5.1 Representation of replicated data
22.5.2 Update protocol
22.5.3 Read protocol
22.5.4 Locking protocol
22.6 Related Readings
23. Distributed System Management
23.1 A Relational System Model
23.2 Instrumentation Issues: Sensors, Actuators
23.3 Management Information Bases, SNMP and CMIP
23.3.1 Sensors and events
23.4 Reactive control in Distributed Settings
23.5 Fault-tolerance by State Machine Replication
23.6 Visualization of Distributed System States
23.7 Correlated Events
23.8 Information Warfare and Defensive Tactics
23.9 Related Readings
24. Cluster Computer Architectures
24.1 Inside a High Availability Cluster Product: The Stratus Radio
24.2 Reliability Goals for Cluster Servers
24.3 Comparison with Fault-Tolerant Hardware
24.4 Protocol Optimizations
24.5 Cluster API Goals and Implementation
24.6 Related Readings
25. Reasoning About Distributed Systems
25.1 Dimensions of the Systems Validation Problem
25.2 Process and Message-Oriented Models
25.3 System Definition Languages
25.4 High Level Languages and Logics
26. Other Distributed and Transactional Systems
26.1 Related Work in Distributed Computing
26.1.5 The Highly Available System (HAS)
26.1.6 The Isis Toolkit
26.1.8 Sender-Based Logging and Manetho
26.1.18 The V System
26.2 Systems That Implement Transactions
26.2.5 Camelot and Encina
Trademarks Cited in the Text
Unix is a Trademark of Santa Cruz Operations, Inc. CORBA (Common Object Request Broker Architecture) and OMG IDL are trademarks of the Object Management Group. ONC (Open Network Computing), NFS (Network File System), Solaris, Solaris MC, XDR (External Data Representation), and Java are trademarks of Sun Microsystems Inc. DCE is a trademark of the Open Software Foundation. XTP (Xpress Transfer Protocol) is a trademark of the XTP Forum. RADIO is a trademark of Stratus Computer Corporation. Isis Reliable Software Developer’s Kit, Isis Reliable Network File System, Isis Reliable Message Bus and Isis for Databases are trademarks of Isis Distributed Computing Systems, Inc. Orbix is a trademark of Iona Technologies Ltd. Orbix+Isis is a joint trademark of Iona and Isis Distributed Computing Systems, Inc. TIB (Teknekron Information Bus) and Subject Based Addressing are trademarks of Teknekron Software Systems (although we use "subject based addressing" in a more general sense in this text). Chorus is a trademark of Chorus Systemes Inc. Power Objects is a trademark of Oracle Corporation. Netscape is a trademark of Netscape Communications. OLE, Windows, Windows New Technology (Windows NT), and Windows 95 are trademarks of Microsoft Corporation. Lotus Notes is a trademark of Lotus Computing Corporation. Purify is a trademark of Highland Software, Inc. Proliant is a trademark of Compaq Computers Inc. VAXClusters, DEC MessageQ, and DECsafe Available Server Environment are trademarks of Digital Equipment Corporation. MQSeries and SP2 are trademarks of International Business Machines. Power Builder is a trademark of PowerSoft Corporation. Visual Basic is a trademark of Microsoft Corporation. Ethernet is a trademark of Xerox Corporation.
Other products and services mentioned in this document are covered by the trademarks, service marks, or product names as designated by the companies that market those products. The author respectfully acknowledges any such that may not have been included above.
Preface and Acknowledgements
This book is dedicated to my family, for their support and tolerance over the two-year period that it was written. The author is grateful to so many individuals, for their technical assistance with aspects of the development, that to try and list them one by one would certainly be to omit someone whose role was vital. Instead, let me just thank my colleagues at Cornell, Isis Distributed Systems, and worldwide for their help in this undertaking. I am also greatful to Paul Jones of Isis Distributed Systems and to Francois Barrault and Yves Eychenne of Stratus France and Isis Distributed Systems, France, for providing me with resources needed to work on this book during a sabbatical that I spent in Paris, in fall of 1995 and spring of 1996. Cindy Williams and Werner Vogels provided invaluable help in overcoming some of the details of working at such a distance from home.
A number of reviewers provided feedback on early copies of this text, leading to (one hopes) considerable improvement in the presentation. Thanks are due to: Marjan Bace, David Bakken, Robert Cooper, Yves Eychenne, Dalia Malki, Raghu Hudli, David Page, David Plainfosse, Henrijk Paszt, John Warne and Werner Vogels. Raj Alur, Ian Service and Mark Wood provided help in clarifying some thorny technical questions, and are also gratefully acknowledged. Bruce Donald’s emails on idiosyncracies of the Web were extremely useful and had a surprisingly large impact on treatment of that topic in this text.
Much of the work reported here was made possible by grants from the U.S. Department of Defense through its Advanced Research Projects Agency, DARPA (administered by the Office of Naval Research, Rome Laboratories, and NASA), and by infrastructure grants from the National Science Foundation. Grants from a number of corporations have also supported this work, including IBM Corporation, Isis Distributed Systems Inc., Siemens Corporate Research (Munich and New Jersey), and GTE Corporation. I wish to express my thanks to all of these agencies and corporations for their generosity.
The techniques, approaches, and opinions expressed here are my own, and may not represent positions of the organizations and corporations that have supported this research.
Despite nearly twenty years of progress towards ubiquitous computer connectivity, distributed computing systems have only recently emerged to play a serious role in industry and society. Perhaps this explains why so few distributed systems are reliable in the sense of tolerating failures automatically, guaranteeing properties such as performance or response time, or offering security against intentional threats. In many ways the engineering discipline of reliable distributed computing is still in its infancy.
One might be tempted to reason tautologically, concluding that reliability must not be all that important in distributed systems (since otherwise, the pressure to make such systems reliable would long since have become overwhelming). Yet, it seems more likely that we have only recently begun to see the sorts of distributed computing systems in which reliability is critical. To the extent that existing mission- and even life-critical applications rely upon distributed software, the importance of reliability has perhaps been viewed as a narrow, domain-specific issue. On the other hand, as distributed software is placed into more and more critical applications, where safety or financial stability of large organizations depends upon the reliable operation of complex distributed applications, the inevitable result will be growing demand for technology developers to demonstrate the reliability of their distributed architectures and solutions. It is time to tackle distributed systems reliability in a serious way. To fail to do so today is to invite catastrophic computer-systems failures tomorrow.
At the time of this writing, the sudden emergence of the "World Wide Web" (variously called the "Web", the Information Superhighway, the Global Information Infrastructure, the Internet, or just the Net) is bringing this issue to the forefront. In many respects, the story of reliability in distributed systems is today tied to the future of the Web and the technology base that has been used to develop it. It is unlikely that any reader of this text is unfamiliar with the Web technology base, which has penetrated the computing industry in record time. A basic premise of our study is that the Web will be a driver for distributed computing, by creating a mass market around distributed computing. However, the term "Web" is often used loosely: much of the public sees the Web as a single entity that encompasses all the Internet technologies that exist today and that may be introduced in the future. Thus when we talk about the Web, we are inevitably faced with a much broader family of communications technologies.
It is clear that some form of critical mass has recently been reached: distributed computing is emerging from its specialized and very limited niche to become a mass-market commodity, something that literally everyone depends upon, like a telephone or an automobile. The Web paradigm brings together the key attributes of this new market in a single package: easily understandable graphical displays, substantial content, unlimited information to draw upon, virtual worlds in which to wander and work. But the Web is also stimulating growth in other types of distributed applications. In some intangible way, the experience of the Web has caused modern society to suddenly notice the potential of distributed computing.
Consider the implications of a societal transition whereby distributed computing has suddenly become a mass market commodity. In the past, a mass-market item was something everyone "owned". With the Web, one suddenly sees a type of commodity that everyone "does". For the most part, the computers and networks were already in place. What has changed is the way that people see them and use them. The paradigm of the Web is to connect useful things (and many useless things) to the network. Communication and connectivity suddenly seem to be mandatory: no company can possibily risk arriving late for the Information Revolution. Increasingly, it makes sense to believe that if an application can be put on the network, someone is thinking about doing so, and soon.
Whereas reliability and indeed distributed computing were slow to emerge prior to the introduction of the Web, reliable distributed computing will be necessary if networked solutions are to be used safely for many of the applications that are envisioned. In the past, researchers in the field wondered why the uptake of distributed computing had been so slow. Overnight, the question has become one of understanding how the types of computing systems that run on the Internet and the Web, or that will be accessed through it, can be made reliable enough for emerging critical uses.
If Web-like interfaces present medical status information and records to a doctor in a hospital, or are used to control a power plant from a remote console, or to guide the decision making of major corporations, reliability of those interfaces and applications will be absolutely critical to the users. Some may have life-or-death implications: if that physician bases a split-second decision on invalid data, the patient might die. Others may be critical to the efficient function of the organization that uses them: if a bank mismanages risk because of an inaccurate picture of how its investments are allocated, the bank could incur huge losses or even fail. In still other settings, reliability may emerge as a key determinant in the marketplace: the more reliable product, at a comparable price, may simply displace the less reliable one. Reliable distributed computing suddenly has broad relevance.
Throughout what follows, the term "distributed computing" is used to describe a type of computer system that differs from what could be called a "network computing" system. The distinction illuminates the basic issues with which we will be concerned.
As we use the term here, a computer network is a communication technology supporting the exchange of messages among computer programs executing on computational nodes. Computer networks are data movers, providing capabilities for sending data from one location to another, dealing with mobility and with changing topology, and automating the division of available bandwidth among contending users. Computer networks have evolved over a twenty year period, and during the mid 1990’s network connectivity between computer systems became pervasive. Network bandwidth has also increased enormously, rising from hundreds of bytes per second in the early 1980’s to millions per second in the mid 1990’s, with gigabit rates anticipated in the late 1990’s and beyond.
Network functionality evolved steadily during this period. Early use of networks was entirely for file transfer, remote login and electronic mail or news. Over time, however, the expectations of users and the tools available have changed. The network user in 1996 is likely to be familiar with interactive network browsing tools such as Netscape’s browsing tool, which permits the user to wander within a huge and interconnected network of multimedia information and documents. Tools such as these permit the user to conceive of a computer workstation as a window into an immense world of information, accessible using a great variety of search tools, easy to display and print, and linked to other relevant material that may be physically stored halfway around the world and yet accessible at the click of a mouse.
Meanwhile, new types of networking hardware have emerged. The first generation of networks was built using point-to-point connections; to present the illusion of full connectivity to users, the network included a software layer for routing and connection management. Over time, these initial technologies were largely replaced by high speed long distance lines that route through various hubs, coupled to local area networks implemented using multiple access technologies such as Ethernet and FDDI: hardware in which a single "wire" has a large number of computers attached to it, supporting the abstraction of a shared message bus. At the time of this writing, a third generation of technologies is reaching the market, such as ATM hardware capable of supporting gigabit communication rates over virtual circuits, mobile connection technologies for the office that will allow computers to be moved without rewiring, and more ambitious mobile computing devices that exploit the nationwide cellular telephone grid for communications support.
As recently as the early 1990’s, computer bandwidth over wide-area links was limited for most users. The average workstation had high speed access to a local network, and perhaps the local email system was connected to the Internet, but individual users (especially those working from PC’s) rarely had better than 1600 baud connections available for personal use of the Internet. This picture is changing rapidly today: more and more users have relatively high speed modem connections to an Internet service provider that offers megabyte-per-second connectivity to remote servers. With the emergence of ISDN services to the home, the last link of the chain will suddenly catch up with the rest. Individual connectivity has thus jumped from 1600 baud to perhaps 28,800 baud at the time of this writing, and may jump to 1 Mbaud or more in the not distant future. Moreover, this bandwidth has finally reached the PC community, which enormously outnumbers the workstation community.
It has been suggested that technology revolutions are often spurred by discontinuous, as opposed to evolutionary, improvement in a key aspect of a technology. The bandwidth improvements we are now experiencing are so disproportionate with respect to other performance changes (memory sizes, processor speeds) as to fall squarely into the discontinuous end of the spectrum. The sudden connectivity available to PC users is similarly disproportionate to anything in prior experience. The Web is perhaps just the first of a new generation of communications-oriented technologies enabled by these sudden developments.
In particular, the key enablers for the Web were precisely the availability of adequate long-distance communications bandwidth to sustain its programming model, coupled to the evolution of computing systems supporting high performance graphical displays and sophisticated local applications dedicated to the user. It is only recently that these pieces fell into place. Indeed, the Web emerged more or less as early as it could possibly have done so, considering the state of the art in the various technologies on which it depends. Thus while the Web is clearly a breakthrough ¾ the "killer application" of the Internet ¾ it is also the most visible manifestation of a variety of underlying developments that are also enabling other kinds of distributed applications. It makes sense to see the Web as the tip of an iceberg: a paradigm for something much broader that is sweeping the entire computing community.
As the trend towards better communication performance and lower latencies continues, it is certain to fuel continued growth in distributed computing. In contrast to a computer network, a distributed computing system refers to computing systems and applications that cooperate to coordinate actions at multiple locations in a network. Rather than adopting a perspective in which conventional (non-distributed) application programs access data remotely over a network, a distributed system includes multiple application programs that communicate over the network, but take actions at the multiple places where the application runs. Despite the widespread availability of networking since early 1980, distributed computing has only become common in the 1990’s. This lag reflects a fundamental issue: distributed computing turns out to be much harder than non-distributed or network computing applications, especially if reliability is a critical requirement.
Our treatment explores the technology of distributed computing with a particular bias: to understand why the emerging generation of critical Internet and Web technologies is likely to require very high levels of reliability, and to explore the implications of this for distributed computing technologies. A key issue is to gain some insight into the factors that make it so hard to develop distributed computing systems that can be relied upon in critical settings, and and to understand can be done to simplify the task. In other disciplines like civil engineering or electrical engineering, a substantial body of practical development rules exists that the designer of a complex system can draw upon to simplify his task. It is rarely necessary for the firm that builds a bridge to engage in theoretical analyses of stress or basic properties of the materials used, because the theory in these areas was long-ago reduced to collections of practical rules and formulae that the practitioner can treat as tools for solving practical problems.
This observation motivated the choice of the cover of the book. The Golden Gate Bridge is a marvel of civil engineering that reflects a very sophisticated understanding of the science of bridge-building. Although located in a seismically active area, the bridge is believed capable of withstanding even an extremely severe earthquake. It is routinely exposed to violent winter storms: it may sway but is never seriously threatened. And yet the bridge is also esthetically pleasing: one of the truely beautiful constructions of its era. Watching the sun set over the bridge from Berkeley, where I attended graduate school, remains among the most memorable experiences of my life. The bridge illustrates that beauty can also be resilient: a fortunate development, since otherwise, the failure of the Tacoma Narrows bridge might have ushered in a generation of bulky and overengineered bridges. The achievement of the Golden Gate bridge illustrates that even when engineers are confronted with extremely demanding standards, it is possible to achieve solutions that are elegant and lovely at the same time as they are resilient. This is only possible, however, to the degree that there exists an engineering science of robust bridge building.
We can build distributed computing systems that are reliable in this sense, too. Such systems would be secure, trustworthy, and would guarantee availability and consistency even when limited numbers of failures occur. Hopefully, these limits can be selected to provide adequate reliability without excessive cost. In this manner, just as the science of bridge-building has yielded elegant and robust bridges, reliability need not compromise elegance and performance in distributed computing.
One could argue that in distributed computing, we are today building the software bridges of the Information Superhighway. Yet in contrast to the disciplined engineering that enabled the Golden Gate Bridge, as one explores the underlying technology of the Internet and the Web one discovers a disturbing and pervasive inattention to issues of reliability. It is common to read that the Internet (developed originally by the Defense Department’s Advanced Research Projects Agency, ARPA) was built to withstand a nuclear war. Today, we need to adopt a similar mindset as we extend these networks into systems that must support tens or hundreds of millions of Web users, and a growing number of hackers whose objectives vary from the annoying to the criminal. We will see that many of the fundamental technologies of the Internet and Web fundamental assumptions that, although completely reasonable in the early days of the Internet’s development, have now started to limit scalability and reliability, and that the infrastructure is consequently exhibiting troubling signs of stress.
One of the major challenges, of course, is that use of the Internet has begun to expand so rapidly that the researchers most actively involved in extending its protocols and enhancing its capabilities are forced to work incrementally: only limited changes to the technology base can be contemplated, and even small upgrades can have very complex implications. Moreover, upgrading the technologies used in the Internet is somewhat like changing the engines on an airplane while it is flying. Jointly, these issues limit the ability of the Internet community to move to a more reliable, secure, and scalable architecture. They create a background against which the goals of this textbook will not easily be achieved.
In early 1995, the author was invited by ARPA to participate in an unclassified study of the survability of distributed systems. Participants included academic experts and invited experts familiar with the state of the art in such areas as telecommunications, power systems management, and banking. This study was undertaken against a backdrop colored by the recent difficulties of the Federal Aviation Agency, which launched a project in the late 1980’s and early 1990’s to develop a new generation of highly reliable distributed air traffic control software. Late in 1994, after losing a huge sum of money and essentially eliminating all distributed aspects of an architecture that was originally innovative precisely for its distributed reliability features, a prototype of the proposed new system was finally delivered, but with such limited functionality that planning on yet another new generation of software had to begin immediately. Meanwhile, article after article in the national press reported on failures of air-traffic control systems, many stemming from software problems, and several exposing airplanes and passengers to extremely dangerous conditions. Such an situation can only inspire the utmost concern in regard to the practical state of the art.
Although our study did not focus on the FAA’s specific experience, the areas we did study are in many ways equally critical. What we learned is that situation encountered by the FAA’s highly visible project is occuring, to a greater or lesser degree, within all of these domains. The pattern is one in which pressure to innovate and introduce new forms of products leads to the increasingly ambitious use of distributed computing systems. These new systems rapidly become critical to the enterprise that developed them: too many interlocked decisions must be made to permit such steps to be reversed. Responding to the pressures of timetables and the need to demonstrate new functionality, engineers inevitably postpone considerations of availability, security, consistency, system management, fault-tolerance ¾ what we call "reliability" in this text ¾ until "late in the game," only to find that it is then very hard to retrofit the necessary technologies into what has become an enormously complex system. Yet, when pressed on these issues, many engineers respond that they are merely following common practice: that their systems use the "best generally accepted engineering practice" and are neither more nor less robust than the other technologies used in the same settings.
Our group was very knowledgeable about the state of the art in research on reliability. So, we often asked our experts whether the development teams in their area are aware of one result or another in the field. What we learned was that research on reliability has often stopped too early to impact the intended consumers of the technologies we developed. It is common for work on reliability to stop after a paper or two and perhaps a splashy demonstration of how a technology can work. But such a proof of concept often leaves open the question of how the reliability technology can interoperate with the software development tools and environments that have become common in industry. This represents a serious obstacle to the ultimate use of the technique, because commercial software developers necessarily work with commercial development products and seek to conform to industry standards.
This creates a quandry: one cannot expect a researcher to build a better version of a modern operating system or communications architecture: such tasks are enormous and even very large companies have difficulty successfully concluding them. So it is hardly surprising that research results are demonstrated on a small scale. Thus, if industry is not eager to exploit the best ideas in an area like reliability, there is no organization capable of accomplishing the necessary technology transition.
For example, we will look at an object-oriented technology called the Common Object Request Broker Architecture, or CORBA, which has become extremely popular. CORBA is a structural methodology: a set of rules for designing and building distributed systems so that they will be explicitly described, easily managed, and so that components can be interconnected as easily as possible. One would expect that researchers on security, fault-tolerance, consistency, and other properties would embrace such architectures, because they are highly regular and designed to be extensible: adding a reliability property to a CORBA application should be a very natural step. However, relatively few researchers have looked at the specific issues that arise in adapting their results to a CORBA setting (we’ll hear about some of the ones that have). Meanwhile, the CORBA community has placed early emphasis on performance and interoperability, while reliability issues have been dealt with primarily by individual vendors (although, again, we’ll hear about some products that represent exceptions to the rule). What is troubling is the sense of "disconnection" between the reliability community and its most likely users, and the implication that reliability is not accorded a very high value by the vendors of distributed systems products today.
Our study contributed towards a decision by the DoD to expand its investment in research on technologies for building practical, survivable, distributed systems. This DoD effort will focus both on developing new technologies for implementing survivable systems, and on developing new approaches to hardening systems built using conventional distributed programming methodologies, and it could make a big difference. But one can also use the perspective gained through a study such as this one to look back over the existing state of the art, asking to what degree the technologies we already have "in hand" can, in fact, be applied to the critical computing systems that are already being developed.
As it happened, I started work on this book during the period when this DoD study was underway, and the presentation that follows is strongly colored by the perspective that emerged from it. Indeed, the study has considerably impacted my own research project. I’ve come to the personal conclusion is that the situation could be much better if developers were simply to begin to think hard about reliability, and had greater familiarity with the techniques at their disposal today. There may not be any magic formulas that will effortlessly confer reliability upon a distributed system, but at the same time, the technologies available to us are in many cases very powerful, and are frequently much more relevant to even off the shelf solutions than is generally recognized. We need more research on the issue, but we also need to try harder to incorporate what we already know how to do into the software development tools and environments on which the majority of distributed computing applications are now based. This said, it is also clear that researchers will need to start paying more attention to the issues that arise in moving their ideas from the laboratory to the field.
Lest these comments seem to suggest that the solution is in hand, it must be understood that there are intangible obstacles to reliability that seem very subtle and yet rather pervasive. Above, it was commented that the Internet and Web is in some ways "fundamentally" unreliable, and that industry routinely treats reliability as a secondary consideration, to be addressed only in mature products and primarily in a "fire fighting" mode, for example after a popular technology is somehow compromised by hackers in a visible way. Neither of these will be easy problems to fix, and they combine to have far-reaching implications. Major standards have repeatedly defered consideration of reliability issues and security until "future releases" of the standards documents or prototype platforms. The message sent to developers is clear: should they wish to build a reliable distributed system, they will need to overcome tremendous obstacles, both internal to their companies and in the search for enabling technologies, and will find relatively little support from the vendors who sell standard computing platforms.
The picture is not uniformly grim, of course. The company I founded in 1988, Isis Distributed Systems, is one of a handful of small technology sources that do offer reliability solutions, often capable of being introduced very transparently into existing applications. (Isis now operates as a division of Stratus Computers Inc., and my own role is limited to occassional consulting). Isis is quite successful, as are many of these companies, and it would be wrong to say that there is no interest in reliability. But these isolated successes are in fact the small story. The big story is that reliability has yet to make much of a dent on the distributed computing market.
The approach of this book is to treat distributed computing technology in a uniform way, looking at the technologies used in developing Internet and Web applications, at emerging standards such as CORBA, and at the technologies available to us for building reliable solutions within these settings. Many texts that set this goal would do so primarily through a treatment of the underlying theory, but our approach here is much more pragmatic. By and large, we treat the theory as a source of background information that one should be aware of, but not as the major objective. Our focus, rather, is to understand how and why practical software tools for reliable distributed programming work, and to understand how they can be brought to bear on the broad area of technology currently identified with the Internet and the Web. By building up models of how distributed systems execute and using these to prove properties of distributed communication protocols, we will show how computing systems of this sort can be formalized and reasoned about, but the treatment is consistently driven by the practical implications of our results.
One of the most serious concerns about building reliable distributed systems stems from more basic issues that would underly any form of software reliability. Through decades of experience, it has become clear that software reliability is a process, not a property. One can talk about design practices that reduce errors, protocols that reconfigure systems to exclude faulty components, testing and quality assurance methods that lead to increased confidence in the correctness of software, and basic design techniques that tend to limit the impact of failures and prevent them from propagating. All of these improve the reliability of a software system, and so presumably would also increase the reliability of a distributed software system. Unfortunately, however, no degree of process ever leads to more than empirical confidence in the reliability of a software system. Thus, even in the case of a non-distributed system, it is hard to say "system X guarantees reliability property Y" in a rigorous way. This same limitation extends to distributed settings, but is made even worse by the lack of a process comparable to the one used in conventional systems. Significant advances are needed in the process of developing reliable distributed computing systems, in the metrics by which we characterize reliability, the models we use to predict their behavior in "new" configurations reflecting changing loads or failures, and in the formal methods used to establish that a system satisfies its reliability goals.
For certain types of applications, this creates a profound quandary. Consider the design of an air traffic control software system, which (among other services) provides air traffic controllers with information about the status of air traffic sectors (Figure I-1). Web sophisticates may want to think of this system as one that provides a web-like interface to a database of routing information maintained on a server. Thus, the controller would be presented with a depiction of the air traffic situation, with push-button style interfaces or other case-specific interfaces providing access to additional information about flights, projected tragectories, possible options for rerouting a flight, and so forth. To the air traffic controller these are the commands supported by the system; the web user might think of them as active hyperlinks. Indeed, even if air traffic control systems are not typical of what the Web is likely to support, other equally critical applications are already moving to the Web, using very much the same "programming model."
A controller who depends upon a system such as this needs an absolute assurance that if the service reports that a sector is available and a plane can be routed into it, this information is correct and that no other controller has been given the same information in regard to routing some other plane. An optimization criteria for such a service would be that it minimize the frequency with which it reports a sector as being occupied when it is actually free. A fault-tolerance goal would be that the service remain operational despite limited numbers of failures of component programs, and perhaps that it perform self-checking operations so as to take a component off-line if it somehow falls out of synchronization with regard to the states of other components. Such goals would avoid scenarios such as the one illustrated in Figure I-2, where the system state has become dangerously inconsistent as a result of a network failure that fools some clients into thinking the primary has failed, and similarly fools the primary and backup into mutually believing one-another to have crashed.
Now, suppose that the techniques of this book were used to construct such a service, using the best available technological solutions, combined with rigorous formal specifications of the software components involved, and the best possible quality process. Theoretical results assure us that inconsistencies such as the one in Figure I-2 cannot arise. Years of testing might yield a very high degree of confidence in the system, yet the service remains a large, complex software artifact. Even minor changes to the system, to add a feature, correct a very simple bug, or to upgrade the operating system version or hardware, could introduce serious problems long after the system was put into production. The question then becomes: can complex software systems ever be used in critical settings? If so, are distributed systems somehow "worse", or are the issues similar?
At the core of the material treated in this book is the consideration seen in this question. There may not be a single answer: distributed systems are suitable for some critical applications and ill-suited for others. In effect, although one can build "reliable distributed software," reliability has its limits and there are problems that distributed software should probably not be used to solve. Even given an appropriate technology, it is easy to build inappropriate solutions – and, conversely, even with an inadequate technology, one can sometimes build critical services that are still useful in limited ways. The air traffic example, described above, might or might not fall into the feasible category, depending on the detailed specification of the system, the techniques used to implement the solution, and the overall process by which the result is used and maintained.
Through the material in this book, the developer will be guided to appropriate design decisions, appropriate development methodologies, and to an understanding of the reliability limits on the solutions that result from this process. No book can expect to instill the sense of responsibility that the reader may need to draw upon in order to make such decisions wisely, but one hopes that computer systems engineers, like bridge builders and designers of aircraft, are highly motivated to build the best and most reliable systems possible. Given such a motivation, an appropriate development methodology, and appropriate software tools, extremely reliable distributed software can be implemented and deployed even into critical settings. We will see precisely how this can be done in the chapters that follow.
Perhaps this book can serve a second purpose in accomplishing its primary one. Many highly placed industry leaders have commented to me that until reliability is forced upon them, their companies will never take the issues involved seriously. The investment needed is simply viewed as very large, and likely to slow the frantic rate of progress on which computing as an industry has come to depend. I believe that the tide is now turning in a way that will, in fact, force change, and that this text can contribute to what will, over time, become an overwhelming priority for the industry.
Reliability is viewed as complex and costly, much as the phrase "robust bridge" conjures up a vision of a massive, expensive, and ugly artifact. Yet, the Golden Gate Bridge is robust and is anything but massive or ugly. To overcome this instinctive reaction, it will be necessary for the industry to come to understand reliability as being compatible with performance, elegance, and market success. At the same time, it will be important for pressure favoring reliability to grow, through demand by the consumers for more reliable products. Jointly, such trends would create an incentive for reliable distributed software engineering, while removing a disincentive.
As the general level of demonstrated knowledge concerning how to make systems reliable rises, the expectation of society and government that vendors will employ such technologies is, in fact, likely to rise. It will become harder and harder for corporations to cut corners by bringing an unreliable product to market and yet advertising it as "fault-tolerant", "secure", or otherwise "reliable". Today, these terms are often used in advertising for products that are not reliable in any meaningful sense at all. One might similarly claim that a building or a bridge was constructed "above code" in a setting where the building code is completely ad-hoc. The situation changes considerably when the building code is made more explicit and demanding, and bridges and buildings that satisify the standard have actually been built successfully (and, perhaps, elegantly and without excessive added cost). In the first instance, a company can easily cut corners; in the second, the risks of doing so are greatly increased.
Moreover, at the time of this writing, vendors often seek to avoid software product liability using complex contracts that stipulate the unsuitability of their products for critical uses, the near certainty that their products will fail even if used correctly, and in which it is stressed that the customer accepts full responsibility for the eventual use of the technology. It seems likely that as such contracts are put to the test, many of them will be recognized as analogous to those used by a landlord who rents an dangerously deteriorated apartment to a tenant, using a contract that warns of the possibility that the kitchen floor could collapse without warning and that the building is a firetrap lacking adequate escape routes. A landlord could certainly draft such a contract and a tenant might well sign it. But if the landlord fails to maintain the building according to the general standards for a safe and secure dwelling, the courts would still find the landlord liable if the floor indeed collapses. One cannot easily escape the generally accepted standards for one’s domain of commercial activity.
By way of analogy, we may see growing pressure on vendors to recognize their fundamental responsibilities to provide a technology base adequate to the actual uses of their technologies, like it or not. Meanwhile, today a company that takes steps to provide reliability worries that in so doing, it may have raised expectations impossibly high and hence exposed itself to litigation if its products fail. As reliability becomes more and more common, such a company will be protected by having used the best available engineering practices to build the most reliable product that it was capable of producing. If such a technology does fail, one at least knows that it was not the consequence of some outrageous form of negligence. Viewed in these terms, many of the products on the market today are seriously deficient. Rather than believing it safer to confront a reliability issue using the best practices available, many companies feel that they run a lower risk by ignoring the issue and drafting evasive contracts that hold themselves harmless in the event of accidents.
The challenge of reliability, in distributed computing, is perhaps the unavoidable challenge of the coming decade, just as performance was the challenge of the past one. By accepting this challenge, we also gain new opportunities, new commercial markets, and help create a future in which technology is used responsibly for the broad benefit of society. There will inevitably be real limits on the reliability of the distributed systems we can build, and consequently there will be types of distributed computing systems that should not be built because we cannot expect to make them adequately reliable. However, we are far from those limits, and are in many circumstances deploying technologies known to be fragile in ways that actively encourage their use in critical settings. Ignoring this issue, as occurs too often today, is irresponsible and dangerous, and increasingly unacceptable. Reliability challenges us as a community: it falls upon us now to respond.
A User’s Guide to This Book
This book was written with several types of readers in mind, and consequently weaves together material that may be of greater interest to one type of reader with that aimed at another type of reader.
Practioners will find that the book has been constructed to be readable more or less sequentially from start to finish. The first part of the book may well be familiar material to many practitioners, but we try to approach this a perspective of understanding reliability and consistency issues that arise even when using the standard distributed systems technologies. We also look at the important roles of performance and modularity in building distributed software that can be relied upon. The second part of the book, which focuses on the Web, is of a similar character. Even if experts in this area may be surprised by some of the subtle reliability and consistency issues associated with the Web, and may find the suggested solutions useful in their work.
The third part of the book looks squarely at reliability technologies. Here, a pragmatically-oriented reader may want to skim through Chapters 13 through 16, which get into the details of some fairly complex protocols and programming models. This material is included for thoroughness, and I don’t think it is exceptionally hard to understand. However, the developer of a reliable system doesn’t necessarily need to know every detail of how the underlying protocols work, or how they are positioned relative to some of the theoretical arguments of the decade! The remainder of the book can be read without having worked through these chapters in any great detail. Chapters 17 and 18 look at the uses of these "tools" through an approach based on what are called wrappers, however, and chapters 19-24 look at some related issues concerning such topics as real-time systems, security, persistent data, and system management. The content is practical and the material is intended to be of a hands-on nature. Thus, the text is designed to be read more or less in order by this type of systems developer, with the exception of those parts of Chapters 13 through 16 where the going gets a bit heavy.
Where possible, the text includes general background material: there is a section on ATM networks, for example, that could be read independently of the remainder of the text, one on Corba, one on message-oriented middleware, and so forth. As much as practical, I have tried to make these sections free-standing and to index them properly, so that if one were worried about security exposures of the NFS file system, for example, it would be easy to read about that specific topic without reading the entire book as well. Hopefully, practitioners will find this text useful as a general reference for the technologies covered, and not purely for its recommendations in the area of security and reliability.
Next, some comments directed towards other researchers and instructors who may read or chose to teach from this text. I based the original outline of this treatment on a course that I have taught several times at Cornell, to a mixture of 4’th year undergraduates, professional Master’s degree students, and 1’st year Ph.D. students. To facilitate the development of course materials, I have placed my slides (created using the Microsoft PowerPoint utility) on Cornell University’s public file server, where they can be retrieved using FTP. (Copy the files from ftp.cs.cornell.edu/pub/ken/slides). The text also includes a set of problems that can be viewed either as thought-provoking exercizes for the professional who wishes to test his or her own understanding of the material, or as the basis for possible homework and course projects in a classroom setting.
Any course based on this text should adopt the same practical perspective as the text itself. I suspect that some of my research colleagues will consider the treatment broad but somewhat superficial; this reflects a decision by the author to focus primarily on "systems" issues, rather than on theory or exhaustive detail on any particular topic. In making this decision, compromises had to be accepted: when teaching from this text, it may be necessary to also ask the students to read some of the more technically complete papers that are cited in subsections of interest to the instructor, and to look in greater detail at some of the systems that are are mentioned only briefly here. On the positive side, however, there are few, if any, introductory distributed systems textbooks that try to provide a genuinely broad perspective on issues in reliability. In the author’s experience, many students are interested in this kind of material today, and having gained a general exposure, would then be motivated to attend a much more theoretical course focused on fundamental issues in distributed systems theory. Thus, while this textbook may not be sufficient in and of itself for launching a research effort in distributed computing, it could well serve as a foundation for such an activity.
It should also be noted that, in my own experience, the book long for a typical 12-week semester. Instructors who elect to teach from it should be selective about the material that will be covered, particularly if they intend to treat chapters 13-17 in any detail. If one has the option of teaching over two semesters, it might make sense to split the course into two parts and to include supplemental material on the Web. I suspect that such a sequence would be very popular given the current interest in network technology. At Cornell, for example, I tend to split this material into a more practical course that I teach in the fall, aiming at our professional master’s degree students, followed by a more probing advanced graduate course that I or one of my colleagues teach in the spring, drawing primarily on the original research papers associated with the topics we cover. This works well for us at Cornell, and the organization and focus of the book match with such a sequence.
A final comment regarding references. To avoid encumbering the discussion with a high density of references, the book cites relevant work the first time a reference to it arises in the text, or where the discussion needs to point to a specific paper, but may not do so in subsequent references to the same work. References are also collected at the end of each chapter into a short section on related readings. It is hard to do adequate justice to such a large and dynamic area of research with any limited number of citations, but every effort has been made to be fair and complete.
Part I: Basic Distributed Computing Technologies
Although our treatment is motivated by the emergence of the Global Information Superhighway and the World Wide Web, this first part of the book focuses on the general technologies on which any distributed computing system relies. We review basic communication options, and the basic software tools that have emerged for exploiting them and for simplifying the development of distributed applications. In the interests of generality, we cover more than just the specific technologies embodied in the Web as it exists at the time of this writing, and in fact terminology and concepts specific to the Web are not introduced until Part II of the book. However, even in this first part, we do discuss some of the most basic issues that arise in building reliable distributed systems, and we begin to establish the context within which reliability can be treated in a systematic manner.
Reduced to the simplest terms, a distributed computing system is a set of computer programs, executing on one or more computers, and coordinating actions by exchanging messages. A computer network is a collection of computers interconnected by hardware that directly supports message passing. Most distributed computing systems operate over computer networks, although this is not always the case: one can build a distributed computing system in which the components execute on a single multi-tasking computer, and one can also build distributed computing systems in which information flows between the components by means other than message passing. Moreover, as we will see in Chapter 24, there are new kinds of parallel computers, called "clustered" servers, that have many attributes of distributed systems despite appearing to the user as a single machine built using rack-mounted components.
We will use the term protocol in reference to an algorithm governing the exchange of messages, by which a collection of processes coordinate their actions and communicate information among themselves. Much as a program is the set of instructions, and a process denotes the execution of those instructions, a protocol is a set of instructions governing the communication in a distributed program, and a distributed computing system is the result of executing some collection of such protocols to coordinate the actions of a collection of processes in a network.
This textbook is concerned with reliability in distributed computing systems. Reliability is a very broad term that can have many meanings, including:
Underlying many of these issues are questions of tolerating failures. Failure, too, can have many meanings:
An even more basic issue underlies all of these: the meaning of computation, and the model one assumes for communication and coordination in a distributed system. Some examples of models include these:
A specific issue that will emerge as being particularly important when we consider guarantees of behavior in Part III of the text concerns the availability, or lack, of accurate temporal information. Until the late 1980’s. the clocks built into workstations were notoriously inaccurate, exhibiting high drift rates that had to be overcome with software protocols for clock resynchronization. There are limits on the quality of synchronization possible in software, and this created a substantial body of research and lead to a number of competing solutions. In the early 1990’s, however, the advent of satellite time sources as part of the global positioning system (GPS) changed the picture: for the price of an inexpensive radio receiver, any computer could obtain accurate temporal data, with resolution in the sub-millisecond range. The degree to which GPS recievers actually replace quartz-based time sources remains to be seen, however. Thus, real-world systems are notable (or notorious) in part for having temporal information, but of potentially low quality.
Normally, the synchronous model also assumes bounds on communication latency between processes, clock skew and precision, and other properties of the environment. As in the case of an asynchronous model, the synchronous one takes an extreme point of view because this simplifies reasoning about certain types of protocols. Real-world systems are not synchronous – it is impossible to build a system in which actions are perfectly coordinated as this model assumes. However, if one proves the impossibility of solving some problem in the synchronous model, or proves that some problem requires at least a certain number of messages in this model, one has established a sort of lower-bound. In a real-world system, things can only get worse, because we are limited to "weaker" assumptions. This makes the synchronous model a valuable tool for understanding how hard it will be to solve certain problems.
Reliable distributed computing systems are assembled from basic building blocks. In the simplest terms, these are just processes and messages, and if our interest was purely theoretical, it might be reasonable to stop at that. On the other hand, if we wish to apply theoretical results in practical systems, we will need to work from a fairly detailed "real" understanding of how practical systems actually work. In some ways, this is unfortunate, because real systems often include mechanisms that are deficient in ways that seem simple to fix, or inconsistent with one another, but have such a long history (or are so deeply embedded into standards) that there may be no way to "improve" on the behavior in question. Yet, if we want to actually build reliable distributed systems, it is unrealistic to insist that we will only do so in idealized environments that support some form of theoretically motivated structure. The real world is heavily committed to standards, and the task of translating our theoretical insights into practical tools that can interplay with these standards is probably the most important challenge faced by the computer systems engineer.
It is common to think of a distributed system as operating over a layered set of network services. Each layer corresponds to a software abstraction or hardware feature, and maybe implemented in the application program itself, in a library of procedures to which the program is linked, in the operating system, or even in the hardware of the communications device. As an illustration, here is the layering of the ISO Open Systems Interconnection (OSI) protocol model [Tan88,Com91,CS91,CS93,CDK94]:
The program using the communication connection
Software to encode application data into messages, and to decode on reception.
The logic associated with guaranteeing end-to-end properties such as reliability.
Software concerned with fragmenting big messages into small packets
Routing functionality, usually limited to small or fixed-size packets
The protocol used to represent packets on the wire
Table 1: ISO Protocol Layers
It is useful to distinguish the types of guarantees provided by the various layers as being end-to-end guarantees in the case of the session, presentation and application layer and point-to-point guarantees for layers below these. The distinction is important in complex networks where a message may need to traverse many links to reach its destination. In such settings, a point-to-point property is one that holds only on a per-hop basis: for example, the data-link protocol is concerned with a single hop taken by the message, but not with its overall route or the guarantees that the application may expect from the communication link itself. The session, presentation and application layers, in contrast, impose a more complex logical abstraction on the underlying network, with properties that hold between the end-points of a communication link that may physically extend over a complex substructure. In Part III of this textbook we will concern ourselves with increasingly elaborate end-to-end properties, until we finally extend these properties into an completely encompassing distributed communication abstraction that embraces the distributed system as a whole and provides consistent behavior and guarantees throughout. And, just as the ISO layering builds its end-to-end abstractions over point-to-point ones, so will we need to build these more sophisticated abstractions over what are ultimately point-to-point properties.
As seen in Figure 1-1, each layer is logically composed of transmission logic and the corresponding reception logic. In practice, this often corresponds closely to the implementation of the architecture: for example, most session protocols operate by imposing a multiple session abstraction over a shared (or "multiplexed")link-level connection. The packets generated by the various higher level session protocols can be conceived of as merging into a single stream of packets that are treated by the IP link level as a single "customer" for its services. Nonetheless, one should not necessarily assume that the implementation of a layered protocol architecture involves some sort of separate module for each layer. To maximize performance, the functionality of a layered architecture is often compressed into a single piece of software, and in some cases layers may be completely bypassed for types of messages where the layer would take no action – for example, if a message is very small, the OSI transport layer wouldn’t need to fragment it into multiple packets, and one could imagine an implementation of the OSI stack specialized for small messages, that omits the transport layer. Indeed, the pros and cons of layered protocol architecture have become a major topic of debate in recent years [CT87, AP93, KP93, KC94, BD95].
Although the OSI layering is probably the best known, the notion of layering communication software is pervasive, and there are many other examples of layered architectures and layered software systems. Later in this textbook we will see ways in which the OSI layering is outdated, because it doesn’t directly address multi-participant communication sessions and doesn’t match very well with some new types of communication hardware, such as asynchronous transfer-mode (ATM) switching systems. In discussing this point we will see that more appropriate layered architectures can be constructed, although they don’t match the OSI layering very closely. Thus, one can think of layering as a methodology, or layering as a very specific thing, matched to the particular layers of the OSI hierarchy. The former perspective is a popular one that is only gaining importance with the introduction of object-oriented distributed computing environments, which have a natural form of layering associated with object classes and subclasses. The later form of layering has probably become hopelessly incompatible with standard practice by the time of this writing, although many companies and governments continue to "require" that products comply with it.
(hardware bit level)
(hardware bit level)
Figure 1-1: Data flow in an ISO protocol stack. Each sending layer is invoked by the layer above it and passes data off to the layer below it, and conversely on the receive side. In a logical sense, however, each layer interacts with its peer on the remote side of the connection. For example, the sender-side session layer may add a header to a message that the receive-side session layer strips off.
Stepping back somewhat, it can be argued that a layered communication architecture is primarily valuable as a descriptive abstraction – a model that captures the essential functionality of a real communication system but doesn’t need to accurately reflect its implementation. The idea of abstracting the behavior of a distributed system in order to concisely describe it or to reason about it is a very important one. However, if the abstraction doesn’t accurately correspond to the implementation, this also creates a number of problems for the system designer, who now has the obligation to develop a specification and correctness proof for the abstraction, to implement, verify and test the corresponding software, and to undertake an additional analysis that confirms that the abstraction accurately models the implementation.
It is easy to see how this process can break down; for example, it is nearly inevitable that changes to the implementation will have to be made long after a system has been deployed. If the development process is genuinely this complex, it is likely that the analysis of overall correctness will not be repeated for every such change. Thus, from the perspective of a user, abstractions can be a two-edged sword. They offer appealing and often simplified ways to deal with a complex system, but they can also be simplistic or even incorrect. And this bears strongly on the overall theme of reliability. To some degree, the very process of cleaning up a component of a system in order to describe it concisely can compromise the reliability of a more complex system in which that component is used.
Throughout the remainder of this book, we will often have recourse to models and abstractions, in much more complex situations than the OSI layering. This will assist us in reasoning about and comparing protocols, and in proving properties of complex distributed systems. At the same time, however, we need to keep in mind that this whole approach demands a sort of "meta approach", namely a higher level of abstraction at which we can question the methodology itself, asking if the techniques by which we create reliable systems are themselves a possible source of unreliability. When this proves to be the case, we need to take the next step as well, asking what sorts of systematic remedies can be used to fight these types of reliability problems.
Can "well structured" distributed computing systems be built that can tolerate the failures of their own components? In layerings like the OSI one, this issue is not really addressed, which is one of the reasons that the OSI layering won’t work well for our purposes in this text. However, the question is among the most important ones that will need to be resolved if we want to claim that we have arrived at a workable methodology for engineering reliable distributed computing systems. A methodology, then, must address descriptive and structural issues as well as practical ones such as the protocols used to overcome a specific type of failure or to coordinate a specific type of interaction.
The most basic communications technology in any distributed system is the hardware support for message passing. Although there are some types of networks that offer special properties, most modern networks are designed to transmit data in packets with some fixed but small maximum size. Each packet consists of a header, which is a data structure containing information about the packet, its destination and route, etc. It contains a body, which is the bytes that make up the content of the packet. And it may contain a trailer, which is a second data structure that is physically transmitted after the header and body, and would normally consist of a checksum for the packet that the hardware computes and appends to it as part of the process of transmitting the packet.
When a user’s message is transmitted over a network, the packets actually sent ‘‘on the wire’’ include headers and trailers, and may have a fixed maximum size. Large messages are sent as multiple packets. For example, Figure 1-2 illustrates a message that has been fragmented into three packets, each containing a header and some part of the data from the original message. Not all fragmentation schemes include trailers, and in the figure no trailer is shown.
Modern communication hardware often permits large numbers of computers to share a single communication "fabric". For this reason, it is necessary to specify the address to which a message should be transmitted. The hardware used for communication therefore will normally support some form of addressing capability, by which the destination of a message can be identified. More important to most software developers, however, are addresses supported by the transport services available on most operating systems. These logical addresses are a representation of location within the network, and are used to route packets to their destinations. Each time a packet makes a "hop" over a communications link, the sending computer is expected to copy the hardware address of the next machine in the path into the outgoing packet. Within this textbook, we assume that each computer has a logical address, but will have little to say about hardware addresses.
On the other hand, there are two hardware addressing features that have important implications for higher level communication software. These are the ability of the hardware to broadcast and multicast messages
A broadcast is a way of sending a message so that it will be delivered to all computers that it reaches. This may not be all the computers in a network, because of the various factors that can cause a receive omission failure to occur, but for many purposes, absolute reliability is not required. To send a hardware broadcast, an application program generally places a special logical address in an outgoing message that the operating system maps to the appropriate hardware address. The message will only reach those machines connected to the hardware communications device on which the transmission occurs, so the use of this feature requires some knowledge of network communications topology.
A multicast is a form of broadcast that communicates to a subset of the computers that are attached to a communications network. To use a multicast, one normally starts by creating a new "multicast group address" and installing it into the hardware interfaces associated with a communications device. Multicast messages are then sent much as a broadcast would be, but are only accepted, at the hardware level, at those interfaces that have been instructed to install the group address to which the message is destined. Many network routing devices and protocols watch for multicast packets and will forward them automatically, but this is rarely attempted for broadcast packets.
Chapter 2 discusses some of the most common forms of communication hardware in detail.
The layer of software that runs over the communications layer is the one most distributed systems programmers deal with. This layer hides the properties of the communications hardware from the programmer. It provides the ability to send and receive messages that may be much larger than the ones supported by the underlying hardware (although there is normally still a limit, so that the amount of operating system buffering space needed for transport can be estimated and controlled). Th transport layer also implements logical addressing capabilities by which every computer in a complex network can be assigned a unique address, and can send and receive messages from every other computer.
Although many transport layers have been proposed, one set of standards has been adopted by almost all vendors. This standard defines the so-called "Internet Protocol" or IP protocol suite, and originated in a research network called the ARPANET that was developed by the U.S. government in the late 1970’s [Tan88,Com91,CDK94]. A competing standard was introduced by the ISO organization in association with the OSI layering cited earlier, but has not gained the sort of ubiquitous acceptance of the IP protocol suite, and there are additional proprietary standards that are widely used by individual vendors or industry groups, but rarely seen outside their community. For example, most PC networks support a protocol called NETBIOS, but this protocol is not common in any other type of computing environment.
Transport services generally offer at least the features of the underlying communication hardware. Thus, the most widely used communication services include a way to send a message to a destination, to broadcast a message, and to multicast a message. Unlike the communications hardware versions of these services, however, transport-layer interfaces tend to work with logical addresses and to automatically route messages within complex environments that may mix multiple forms of communication hardware, or include multiple communication subnetworks bridged by routing devices or computers.
All of this is controlled using routing tables, like the one shown below. A routing table is a data structure local to each computer in a network – each computer has one, but the contents will generally not be identical from machine to machine. The table is indexed by the logical address of a destination computer, and entries contain the hardware device on which messages should be transmitted (the "next hop" to take). Distributed protocols for dynamically maintaining routing tables have been studied for many years, and seek to minimize the number of hops a message needs to take to reach its destination but to also spread load evenly and route around failures or congested nodes. In practice, however, static routing tables are probably more common: these are maintained by a system administrator for the network and generally offer a single route from a source to each destination. Chapter 3 discusses some of the most common transport services in more detail.
Outgoing link 1
Outgoing link 2
Outgoing link 2
Outgoing link 2
Figure -5: A sample routing table, such as might be the one used by computer 188.8.131.52 in Figure 1-4.
A limitation of the basic message passing services discussed in Section 1.2.2 is that they operate at the level of individual messages, and provide no guarantees of reliability. Messages can be lost for many reasons, including link failures, failures of intermediate machines on a complex multi-hop route, noise that causes corruption of the data in a packet, lack of buffering space (the most common cause), and so forth. For this reason, it is common to layer a reliability protocol over the message passing layer of a distributed communication architecture. The result is called a reliable communication channel. This layer of software is the one that the OSI stack calls the "session layer", and corresponds to the TCP protocol of the Internet. UNIX programmers may be more familiar with the notion from their use of "pipes" and "streams" [Rit84].
The protocol implementing a reliable communication channel will typically guarantee that lost messages will be retransmitted and that out-of-order messages will be resequenced and delivered in the order sent. Flow control and mechanisms that choke back the sender when data volume becomes excessive are also common in protocols for reliable transport [Jac88]. Just as the lower layers can support one-to-one, broadcast and multicast communication, these forms of destination addressing are also potentially interesting in reliable transport layers. Moreover, some systems go further and introduce additional reliability properties at this level, such as authentication (a trusted mechanism for verifying the identity of the processes at the ends of a communication connection), or security (trusted mechanisms for concealing the data transmitted over a channel from processes other than the intended destinations). In Chapter 3 we will begin to discuss these options, as well as some very subtle issues concerned with how and when connections report failure.
The most interesting issues that we will consider in this textbook are those relating to programming environments and tools that live in the middle, between the application program and the communications infrastructure for basic message passing and support for reliable channels.
Examples of important middleware services include the naming service, the file system, the time service, and the security "key" services used for authentication in distributed systems. We will be looking at all of these in more detail below, but we review them briefly here for clarity.
A naming service is a collection of user-accessible directories that map from application names (or other selection criteria)to network addresses of computers or programs. Name services can play many roles in a distributed system, and represent an area of intense research interest and rapid evolution. When we discuss naming, we’ll see that the whole question of what a name "represents" is itself subject to considerable debate, and raises important questions about notions of abstraction and services in distributed computing environments. Reliability in a name service involves issues such as trust – can one trust the name service to truthfully map a name to the correct network address? How can one know that the object at the end of an address is the same one that the name service was talking about? These are fascinating issues, and we will have a lot to say about them later (see, for example, Section 7.2).
From the outset, though, the reader may want to consider that if an intruder breaks into a system and is able to manipulate the mapping of names to network addresses, it will be possible to interpose all sorts of "snooping" software components in the path of communication from an application to the services it is using over the network. Such attacks are now common on the Internet and reflect a fundamental issue, which is that most network reliability technologies tend to trust the lowest level mechanisms that map from names to addresses and that route messages to the correct host when given a destination address.
A time service is a mechanism for keeping the clocks on a set of computers closely synchronized and close to "real time". Time services work to overcome the inaccuracy of inexpensive clocks used on many types of computers, and are important in applications that either coordinate actions using real-time, or that make use of time for other purposes, such as to limit the lifetime of a cryptographic key or to timestamp files when they are updated. Much can be said about time in a distributed system, and we will spend a considerable portion of this textbook on issues that revolve around the whole notion of "before" and "after" and their relation to intuitive notions of time in the real world. Clearly, the reliability of a time service will have important implications for the reliability of applications that make use of time, so time services and associated reliability properties will prove to be important in many parts of this textbook.
Authentication services are, perhaps surprisingly, a new technology that is lacking in most distributed computing environments. These services provide trustworthy mechanisms for determining who sent a message, for making sure that the message can only be read by the intended destination, and for restricting access to private data so that only authorized access can occur. Most modern computing systems evolved from a period when access control was informal and based on a core principle of trust among users. One of the really serious implications is that distributed systems that want to superimpose a security or protection architecture on a heterogeneous environment must overcome a pervasive tendency to accept requests without questioning them, to believe the user-id information including in messages without validating it, and to route messages wherever they may wish to go.
If banks worked this way, one could walk up to a teller in a bank that one had never visited before and pass that person a piece of paper requesting a list of individuals that have accounts in the branch. Upon studying the response and learning that "W. Gates" is listed, one could then fill out an account balance request in the name of W. Gates, asking how much money is in that account. And after this, one could withdraw some of that money, up to the bank’s policy limits. At no stage would one be challenged: the identification on the various slips of paper would be trusted for each operation. Such a world model may seem bizarrely trusting, but it is the model from which modern distributed computing systems emerged.
An important topic around which much of this book is oriented concerns the development of general purpose tools from which specialized distributed systems can be constructed. Such tools can take many forms, ranging from the purely conceptual – for example, a methodology or theory that offers useful insight into the best way to solve a problem or that can help the developer confirm that a proposed solution will have a desired property. A tool can offer practical help at a very low level, for example by eliminating the relatively mechanical steps required to encode the arguments for a remote procedure call into a message to the server that will perform the action. A tool can embody complex higher level behavior, such as a protocol for performing some action or overcoming some class of errors. Tools can even go beyond this, taking the next step by offering mechanisms to control and manage software built using other tools.
It has become popular to talk about distributed systems that support distributed operating environments – well integrated collections of tools that can be used in conjunction with one another to carry out potentially complex distributed programming tasks. Examples of distributed programming environments are the Open Network Computing (ONC) environment of SUN Microsystems, The Distributed Computing (DCE) of Open Software Foundation, the various CORBA-compliant programming tools that have become popular among C++ programmers who work in distributed settings, and the Isis Toolkit and the Horus system; these last two being systems developed by the author of this text and his colleagues, which will be discussed in Chapter 18.
Distributed systems architectures undertake to step even beyond the notion of a distributed computing environment. An architecture is a general set of design principles and implementation standards by which a collection of "compliant" systems can be developed. In principle, multiple systems that implement the same architecture will interoperate, so that if vendors implement competing solutions, the resulting software can still be combined into a single system with components that might even be able to communicate and cooperate with one another. The Common Request Broker, or CORBA, is probably the best known distributed computing architecture; it is useful for building systems using an object-oriented approach in which the systems are developed as modules that cooperate. Thus, CORBA is an architecture, and the various CORBA-based products that comply with the architecture are distributed computing environments.
One might expect that the "end of the line" for a layered distributed systems architecture would be the application level, but this is not necessarily the case. A distributed application might also be some sort of operating system service built over the communications tools that we have been discussing. For example, the distributed file system is an application in the sense of the OSI layering, but the user of a computing system might think of the file system as an operating system service over which applications can be defined and executed. Within the OSI layering, then, an application is any free-standing solution to a well defined problem that presents something other than a communications abstraction to its users. The distributed file system is just one example among many. Others include message bus technologies, distributed database systems, electronic mail, network bulletin boards, and the World-Wide-Web. In the near future, computer supported collaborative work systems and multimedia digital library systems are likely to emerge as further examples in this area.
A limitation of a layering like the OSI hierarchy is that it doesn’t really distinguish these sorts of applications, which provide services to higher level distributed applications, from what might be called end-user solutions, namely programs that operate over the communications layer to directly implement commands for a human being. One would like to believe that there is much more structure to a distributed air traffic control system than to a file transfer program, yet the OSI hierarchy views both as examples of ‘‘applications.’’ We lack a good classification system for the various types of distributed applications.
In fact, even complex distributed applications may merely be components of even larger-scale distributed systems – one can easily imagine a distributed system that uses a distributed computing toolkit to integrate an application that exploits distributed files with one that stores information into a distributed database. In an air-traffic control environment, availability may be so critical that one is compelled to run multiple copies of the software concurrently, with one version backing up the other. Here, the entire air traffic control system is at one level a complex distributed application in its own right, but at a different ‘‘meta’’ level, is just a component of an over-arching reliability structure visible on a scale of hundreds of computers located within multiple air traffic centers.
One of the major challenges to building reliable distributed systems is that computer networks have evolved to have a great many "dependencies" on a variety of technologies. Some of the major ones are identified in Figure 1-6, however the set is growing steadily and this figure is not necessarily complete. Critical applications often introduce new servers and critical components not shown here, nor does the figure treat dependencies on hardware components of the distributed infrastructure, such as the communication network itself, power supply, or hardware routers. Moreover, the telecommunications infrastructure underlying a typical network application is itself a complex network with many of the same dependencies internal to itself, together with additional ones such as the databases used to resolve mobile telephone numbers or to correctly account for use of network communication lines.
Fortunately, many of these services are fairly reliable, and one can plan around potential outages of such critical services as the network information service. The key issue is to understand the technology dependencies that can impact reliability issues for a specific application and to program solutions into the network to detect and work around potential outages. In this textbook we will be studying technical options for taking such steps. The emergence of integrated environments for reliable distributed computing will, however, require a substantial effort from the vendors offering the component technologies: an approach in which reliability is left to the application inevitably overlooks the problems that can be caused when such applications are forced to depend upon technologies that are themselves unreliable for reasons beyond the control of the developer.
While distributed systems are certainly layered, Figure 1-6 makes it clear that one should question the adequacy of any simple layering model for describing reliable distributed systems. We noted, for example, that many governments have mandated the use of the ISO layering for description of distributed software. Yet there are important reliability technologies that require structures inexpressible in this layering, and it is unlikely that those governments intended to preclude the use of reliable technologies. More broadly, the sorts of complex layerings that can result when tools are used to support applications that are in turn tools for still higher level applications are not amenable to any simple description of this nature. Does this mean that users should refuse the resulting complex software structures, because they cannot be described in terms of the standard? Should they accept the perspective that software should be used but not described, because the description methodologies seem to have lagged the state of the art? Or should governments insist on new standards each time a new type of system finds it useful to step outside of the standard?
Questions such as these may seem narrow and almost pointless, yet they point to a deep problem. Certainly, if we are unable to even describe complex distributed systems in a uniform way, it will be very difficult to develop a methodology within which one can reason about them and prove that they respect desired properties. On the other hand, if a standard proves unwieldy and constraining, it will eventually become difficult for systems to adhere to it.
Perhaps for these reasons, there has been little recent work on layering in the precise sense of the ISO hierarchy: most researchers view this as an unpromising direction. Instead, the notions of structure and hierarchy seen in ISO have reemerged in much more general and flexible ways: the object class hierarchies supported by technologies in the CORBA framework, the layered protocol stacks supported in operating systems like UNIX or the x-Kernel, or in systems such as Horus. We’ll be reading about these uses of hierarchy later in the textbook, and the ISO hierarchy remains popular as a simple but widely understood framework within which to discuss protocols.
General discussion of network architectures and the ISO hierarchy: [Tan88, Com91, CS91, CS93, ANSA91a, ANSA91b, ANSA89, CD90, CDK94, XTP95]. Pros and Cons of layered architectures: [CT87, RST88, RST89, Ous90, AP93, KP93, KC94, BD95]. Reliable stream communication: [Rit84, Jac88, Tan88, Com91, CS91, CS93, CDK94]. Failure Models and Classification: [Lam78b, Lam84, Ske82b, FLP85, ST87, CD90, Mar90, Cri91a, CT91, CHT92, GR93, SM94].
Historically, it has rarely been necessary to understand details of the hardware components from which a computing system was constructed if one merely wishes to develop software for it. The pressure to standardize operating systems, and the presentation of devices within them, created a situation in which it sufficed to understand the way that the operating system abstracted a device in order to use it correctly.
For example, there are a great many designs for computer disk storage units and the associated device controllers. Each design offers its own advantages and disadvantages when compared with the others, and any systems architect charged with selecting a data storage device would be wise to learn about the state of the art before making a decision. Yet, from a software perspective, device attributes are largely hidden. The developer normally considers a disk to be a device on which files can be stored, having various layout parameters that can be tuned to optimize I/O performance, and characterized by a set of speed and reliability properties. Developers of special classes of applications, such a multi-media image servers, may prefer to work with a less abstracted software interface to the hardware, exploiting otherwise hidden features at the cost of much greater software complexity. But for the normal user, one disk is much like any other.
To a considerable extent, the same is true for computer networking hardware. There are a number of major classes of communications devices, differing in speed, average access latency, maximum capacity (packets per second, bytes of data per second), support for special addressing modes, etc. However, since most operating systems implement the lowest layers of the OSI hierarchy as part of the device driver or communications abstraction of a system, applications can treat these devices interchangeably. Indeed, it can be quite difficult to determine just what the communications topology of a system actually is, because many operating systems lack services that would permit the user to query for this information.
In the remainder of this chapter, we review communication hardware in very superficial terms, giving just enough detail so that the reader should be familiar with technology names and properties, but without getting into the level of technical issues that would be important in designing the network topology for a demanding enterprise.
Throughout this chapter, the reader will notice that we use the term packet to refer to the type of messages that can be exchanged between communications devices. The distinction between a packet and a message, throughout this book, is that a message is a logical object generated by the application for transmission to one or more destinations. A message may be quite large, and can easily exceed the limits imposed by the operating system or the communications hardware. For transmission, messages are therefore fragmented into one or more packets, if necessary. A packet, then, is a hardware level message that often respects hardware-imposed size and format constraints, and may contain just a fragment of an application-level message.
Communications devices can be coarsely partitioned into functional classes:
Communications devices vary enormously in their properties, although this variability is often concealed by the layers of systems software through which applications operate. In simple terms, communications devices can be "rated" by a series of metrics:
At the time of this writing, ethernet is the most widely used communications technology for local-area networks (networks within a limited physical region, such as a single floor of a building). Bridged ethernets are the most common technology for networks within small enterprises, such as a large company at a single site.
As summarized earlier, the basic technology underlying an ethernet is a shared coaxial cable, on which signals are transmitted using a modulation technology similar to that of a radio. Packets have a fixed maximum size of 1400 bytes, but the size can be varied as long as this limit is not exceeded. In practice, software that runs over Ethernet will often be limited to approximately 1024 bytes of "payload" in each packet; the remaining 376 bytes are then available for headers, trailers, and data representation information. The ethernet itself, however, treats the entire object as data. The specific encoding used to represent packets will not be important to us here, but the basic idea is that each interface is structured into a sending side, and a listening side, and the latter is continuously active.
To receive a message, the listening side waits until it senses a packet header. The incoming packet is then copied to a memory buffer internal to the ethernet interface. To be accepted, a packet must have a valid checksum, and must specify an destination address that the interface has been preprogrammed to accept. Specifically, each interface has some number of programmable address registers, consisting of a "pattern mask" and a corresponding "value mask", each 32-bits in length. The pattern mask specifies bits within the destination address that must exactly match the corresponding bits of the value mask. A pattern mask that selects for all bits of the address will require an exact match between the packet and the value mask. A pattern mask that selects no bits will match every incoming packet – an interface with such an address loaded is said to be in promiscuous mode.
A received packet is copied into memory in the host computer, or discarded if no memory for an incoming packet is available. The host is then interrupted. Most ethernet interfaces permit the host to enqueue at least two memory regions for incoming messages, and some permit the host to chain a list of memory regions. Most also permit multiple (address,mask) pairs to be loaded into the interface, often as many as 64.
To send a packet, the ethernet interface waits for a pause between packets – a time when its listening side is idle. It then transmits the packet, but also listens to its own transmission. The idea is that if two ethernets both attempt to send at the same time, a collision will occur and the messages will overwrite one another, causing a noise burst. The receive logic will either fail immediately, or the checksum test will fail, since anything the interfaces read back in will be damaged by the collision, and the sending logic will recognize that a problem has occurred. The hardware implements an exponential back-off algorithm, in which the sending side delays for a randomly selected period of time within an interval that starts at a small value but doubles with each successive collision up to a maximum upper value. Although the probability of a collision on a first attempt to send can be high when the ethernet becomes loaded, exponential back-off has been shown to give very good average access behavior with excellent fairness properties. Because collisions are often detectable within a few bits after starting to send, ethernets lose little data to collisions even under heavy load, and the back-off algorithm can be shown to provide very uniform delays for access to the medium over very large numbers of senders and very large excess loads.
As a general rule, although a single interface can send multiple packets, a small amount of dead space will separate each packet in the stream, because a small amount of work by the operating system is normally needed before each successive packet can be transmitted, and because the ethernet hardware logic requires a small amount of time to compare the checksum on the echo of the outgoing packet, and to trigger an interrupt to the device driver, before starting to send a new packet. In contrast, when more than one interface is used to send data, sequences of "back to back" packets can be generated, potentially forcing the interface to accept several packets in a row with almost no delay between them. Obviously, this can result in packet loss if the chain of memory for incoming messages is exhausted. However, precisely because an ethernet is shared, the probability that any one interface will be the destination for any large number of back-to-back packets is low. File system servers and bridges, which are more likely to receive back-to-back packets, compensate for this by using long chains of buffers for incoming messages, and implementing very lightweight logic for dealing with received messages as quickly as possible.
An interesting feature of the ethernet is that it supports both broadcast and multicast in hardware. The two features are implemented in the same way. Before any communication is undertaken, the ethernet interface is preloaded with a special address – one that is the same on all machines within some set. Now, if a packet that contains this address is transmitted, all machines in that set will receive a copy, because all of their interfaces will detect a match.
To support broadcast, a special address is agreed upon and installed on all interfaces in the entire network, typically at the time the machine is first booted. Broadcast packets will now be received by every machine, offering a way to distribute the same data to a great many machines at very low cost. However, one should keep in mind that each interface individually computes the checksum, and hence that some interfaces may discard a packet as corrupted, while others accept it. Moreover, some machines may lack input buffers and hence may be incapable of accepting packets that are correctly received by the interface. Thus, ethernet broadcast can potentially send a single packet to all machines on a network, but in practice the technology is not a reliable one.
Multicast uses precisely the same approach, except that a subset of machines pick a group address that is unique within the network and install this into their interfaces. Messages destined to a multicast address will be accepted (up to checksum failures) by just these machines, and will be ignored by others. The maximum number of multicast addresses that an interface can support varies from vendor to vendor. As in the case of broadcast, hardware multicast is reasonably reliable but not absolutely so.
Earlier, we commented that even a very reliable communications device may exhibit modal behavior whereby reliability can be much poorer for certain communication patterns. One example of this problem is termed the broadcast- or multicast-storm, and arises when broadcast or multicast is used by multiple senders concurrently. In this situation, it becomes much more likely that a typical network interface will actually need to accept a series of back-to-back packets – with sufficiently many senders, chains of arbitrary length can be triggered. The problem is that in this situation the probability that two back to back messages are destined to the same machine, or set of machines, becomes much higher than for a more typical point-to-point communication load. As a result, the interface may run out of buffers for incoming packets and begin to drop them.
In a broadcast storm situation, packet loss rises to very high levels, because network interfaces become chronically short of memory for incoming packets. The signature of a broadcast storm is that heavy use of broadcast or multicast by multiple senders causes a dramatic increase in the packet loss rate throughout the network. Notice that the problem will affect any machine that has been programmed to accept incoming broadcasts, not just the machines that make meaningful use of the packets after they arise. Fortunately, the problem is uncommon if just a single sender is initiating the broadcasts, because a single sender will not generate back-to-back packets.
FDDI is a multi-access communication technology based upon a ring architecture, in which interfaces are interconnected by shielded "twisted pair" wiring. An interface plays a dual role:
• As a repeater, an FDDI interface receives messages from the interface to its left, accepts those messages that match an incoming address pattern (similar to ethernet), and then (accepted or not), forwards the message to the interface on the right. Forwarding occurs bit by bit or in small blocks of bits, so the delay associated with forwarding packets can be very low.
• As a transmitter, an FDDI interface waits until it has permission to initiate a packet, which occurs when there is no other packet being forwarded, and then sends its packet to the interface on its right. When the packet has completed its trip around the ring, the transmitter receives it from the interface to the left and deletes it. Status information is available in the packet trailer, and can be used to immediately retransmit a packet that some intended destination was unable to accept because of a shortfall of memory or because it detected a checksum error.
Finally, FDDI has a built-in fault-tolerance feature: if a link fails, FDDI will automatically reconfigure itself to route around it, as illustrated in
Figure 2-2. In contrast, a severed ethernet will either become inoperative, or will partition into two or more segments that are disconnected from one-another..
FDDI throughput (150Mbits/second) is about 15 times greater than standard 10-Mbit ethernet, although high speed 100-Mbit ethernet interfaces have recently been introduced that can approach FDDI performance. Latency for an FDDI ring, in particular, is poorer than that of an ethernet, because of the protocol used to wait for permission to send, and because of delays associated with forwarding the packet around the ring. In complex environments, ethernets or FDDI rings may be broken into segments, which are connected by some form of bridge or routing device; these environments have latency and throughput properties that are more difficult to quantify, because the topology of the interconnection path between a pair of computers can significantly impact the latency and throughput properties of the link.
B-ISDN is a standard introduced by telecommunications switching companies for advanced telephone services. ISDN stands for "integrated services digital network", and is based on the idea of supporting a communications system that mixes digital voice with other types of digital data, including video. The "B" stands for broadband and is a reference to the underlying data links, which operate at the extremely high speeds needed to handle these kinds of data.
Layered over B-ISDN, which is an infrastructure standard, is the emerging intelligent network, an elaborate software architecture for supporting next-generation telecommunications services. Such services have been slower to emerge than had originally been predicted, and many companies have become cautious in accessing intelligent network prospects for the near-term future. However, it may simply be the case that the intelligent network has been harder to deploy than was originally predicted: there are many signs that the long-predicted revolution in telecommunications is really happening today.
The B-ISDN architecture is elaborate, and the intelligent network is even more so. Our focus on reliability of general distributed computing systems prevents us from discussing either in any detail here. However, one interesting feature of B-ISDN merits mention. Although destination addresses in ISDN are basically telephone numbers, ISDN interprets these in a very flexible manner that permits the architecture to do far more than just creating connections between telephones and other paired devices. Instead, the ISDN architecture revolves around the notion of an intelligent switching system: each packet follows a route from the source through a series of switches to its destination. When this route is first established, each switch maps the destination telephone number to an appropriate outgoing connection (to another switch), to a local point of data delivery (if this switch serves the destination), or to a service. The decision is made by looking up the telephone number in a database, using a software procedure that can be reprogrammed to support very elaborate behaviors.
For example, suppose that a telephone company wanted to offer a special communications service to computer vendors in some metropolitan area. This service would offer a single telephone number to each vendor and would arrange to automatically route a call to the mobile telephone of the computer repair person physically closest to the caller.
To implement this service using B-ISDN, the telephone company would make use of a database it already maintains, giving the locations of mobile telephone units. As a call is routed into a switch, the company would sense that the destination is one of the special telephone numbers assigned to this new service, and would invoke a database search algorithm programmed to lookup the physical address of the caller, and then match this with the locations of service vehicles for the called organization to pick the closest one, routing the call to the computer vendor’s switchboard if the lookup fails. Although timing constraints for this process are demanding (actions are generally required within a small fraction of a second), modern computers are becoming fast enough to work within these sorts of deadlines. And, one can imagine a great number of other B-ISDN services, including message centers, automatic playback of pre-recorded information, telephone numbers that automatically patch into various types of public or private information bases, and so forth. The potential is huge.
B-ISDN is also illustrative of how advances in telecommunications switching technology are creating new demands for reliable distributed software services. It is common to require that telephone systems maintain extremely high levels of reliability – a typical requirement is that not more than one call in 100,000 be dropped, and downtime for an entire switch may be limited to seconds per year or less – switches are increasingly used to support critical services such as 911 emergency numbers and communication between air traffic controllers and police vehicles. caller.
Reliability of this sort has many implications for developers of advanced switching systems. The switches themselves must be paired, and protocols for doing so have been standardized as part of an architecture called Signalling System 7 (SS7), which is gradually entering into world-wide use. The co-processors on which intelligent services reside are often constructed using fault-tolerant computing hardware. The software that implements the switching logic must be self-managing, fault-tolerant, and capable of supporting on-line upgrades to new versions of applications and of the operating system itself. And, because many services require some form of distributed database, such as the database of location information that arose in the telephone dispatch example above, sets of coprocessors will often need to be organized into distributed systems that manage dynamically changing replicated data and take actions in a consistent but decentralized manner. For example, routing a call may require independent routing decisions by the service programs associated with several switches, and these decisions need to be based upon consistent data or the call will eventually be dropped, or will be handled incorrectly.
B-ISDN, then, and the intelligent network that it is intended to support, represent good examples of settings where the technology of reliable distributed computing is required, and will have a major impact on society as a whole. Given solutions to reliable distributed computing problems, a vast array of useful telecommunication services will become available starting in the near future and continuing over the decades to come. One can imagine a telecommunications infrastructure that is nearly ubiquitous and elegantly integrated into the environment, providing information and services to users without the constraints of telephones that are physically wired to the wall and computer terminals or televisions that weigh many pounds and are physically attached to a company’s network. But the dark side of this vision is that without adequate attention to reliability and security, this exciting new world will also be erratic and failure-prone.
Asynchronous Transfer Mode, or ATM, is an emerging technology for routing small digital packets in telecommunications networks. When used at high speeds, ATM networking is the "broadband" layer underlying B-ISDN; thus, an article describing a B-ISDN service is quite likely to be talking about an application running on an ATM network that is designed using the B-ISDN architecture.
ATM technology is considered especially exciting both because of its extremely high bandwidth and low latencies, and because this connection to B-ISDN represents a form of direct covergence between the telecommunications infrastructure and the computer communications infrastructure. With ATM, for the first time, computers are able to communicate directly over the data transport protocols used by the telephone companies. Over time, ATM networks will be more and more integrated with the telephone system, offering the possibility of new kinds of telecommunications applications that can draw immediately upon the world-wide telephone network. Moreover, ATM opens the door for technology migration from those who develop software for computer networks and distributed systems into the telecommunications infrastructure and environment.
The packet switches and computer interfaces needed in support of ATM standards are being deployed rapidly in industry and research settings, with performance expected to scale from rates comparable to those of a fast ethernet for first-generation switches to gigabit rates in the late 1990’s and beyond. ATM is defined as a routing protocol for very small packets, containing 48 bytes of payload data with a 5-byte header. These packets traverse routes that must be pre-negotiated between the sender, destination, and the switching network. The small size of the ATM packets leads some readers to assume that ATM is not really "about" networking in the same sense as an ethernet, with its 1400-byte packets. In fact, however, the application programmer normally would not need to know that messages are being fragmented into such a small size, tending instead to think of ATM in terms of its speed and low latency. Indeed, at the highest speeds, ATM cells can be thought of almost as if they were fat bits, or single words of data being transferred over a backplane.
ATM typically operates over point-to-point fiber-optic cables, which route through switches. Thus, a typical ATM installation might resemble the one shown in Figure 2-4. Notice that in this figure, some devices are connected directly to the ATM network itself and not handled by any intermediary processors. The rationale for such an architecture is that ATM devices may eventually run at such high data rates (today, an "OC3" ATM network operates at 155Mbits/second (Mbps), and future "OC24" networks will run at a staggering 1.2Gbps) that any type of software intervention on the path between the data source and the data sink would be out of the question. In such environments, application programs will more and more be relegated to a supervisory and control role, setting up the links and turning the devices on and off, but not accessing the data flowing through the network in a direct way. Not shown are adaptors that might be used to interface an ATM directly to an ethernet or some other local area technology, but these are also available on the market today and will play a big role in many furture ATM installations. These devices allow an ATM network to be attached to an ethernet, token ring, or FDDI network, with seamless communication through the various technologies. They should be common by late in the 1990’s.
The ATM header consists of a VCI (2 bytes, giving the virtual circuit id), a VPI (1 byte giving the virtual path id), a flow-control data field for use in software, a packet type bit (normally used to distinguish the first cell of a multi-cell transmission from the subordinate ones, for reasons that will become clear momentarily), a cell "loss priority" field, and a 1-byte error-checking field that typically contains a checksum for the header data. Of these, the VCI and the packet type (PTI) bit are the most heavily used, and the ones we discuss further below. The VPI is intended for use when a number of virtual circuits connect the same source and destination; it permits the switch to multiplex such connections in a manner that consumes less resources than if the VCI’s were used directly for this purpose. However, most current ATM networks set this field to 0, and hence we will not discuss it further here.
There are three stages to creating and using an ATM connection. First, the process initiating the connection must construct a "route" from its local switch to the destination. Such a route consists of a path of link addresses. For example, suppose that each ATM switch is able to accept up to 8 incoming links and 8 outgoing links. The outgoing links can be numbered 0-7, and a path from any data source to any data sink can then be expressed as a series of 3-bit numbers, indexing each successive hop that the path will take. Thus, a path written as 184.108.40.206.1.4 might describe a route through a series of 6 ATM switches. Having constructed this path, a virtual circuit identifier is created and the ATM network is asked to "open" a circuit with that identifier and path. The ATM switches, one by one, add the identifier to a table of open identifiers and record the corresponding out-link to use for subsequent traffic. If a bidirectional link is desired, the same path can be set up to operate in both directions. The method generalizes to also include multicast and broadcast paths. The VCI, then, is the virtual circuit identifier used during the open operation.
Having described this, however, it should be stressed that many early ATM applications depend upon what are called "permanent virtual channels", namely virtual channels that are preconfigured by a systems administrator at the time the ATM is installed, and changed rarely (if ever) thereafter. Although it is widely predictated that dynamically created channels will eventually dominate the use of ATM, it may turn out that the complexity of opening channels and of ensuring that they are closed correctly when an endpoint terminates its computation or fails will emerge as some form of obstacle that presents this step from occuring.
In the second stage, the application program can send data over the link. Each outgoing message is fragmented, by the ATM interface controller, into a series of ATM packets or "cells". These cells are prefixed with the circuit identifier that is being used (which is checked for security purposes), and the cells then flow through the switching system to their destination. Most ATM devices will discard cells in a random manner if a switch becomes overloaded, but there is a great deal of research underway on ATM scheduling and a variety of so-called quality of service options will become available over time. These might include guarantees of minimum bandwidth, priority for some circuits over others, or limits on the rate at which cells will be dropped. Fields such as the packet type field and the cell loss priority field are intended for use in this process.
It should be noted, however, that just as many early ATM installations use permanent virtual circuits instead of supporting dynamically created circuits, many also treat the ATM as an ethernet emulator, and employ a fixed bandwidth allocation corresponding roughly to what an ethernet might offer. It is possible to adopt this approach because ATM switches can be placed into an emulation mode in which they support broadcast, and early ATM software systems have taken advantage of this to layer the TCP/IP protocols over ATM much as they are built over an ethernet. However, fixed bandwidth allocation is inefficient, and treating an ATM as if it were an ethernet somewhat misses the point! Looking to the future, most reseachers expect this emulation style of network to gradually give way to direct use of the ATM itself, which can support packet-switched multicast and other types of communication services. Over time, "value-added switching" is also likely to emerge as an important area of competition between vendors; for example, one can easily imagine incorporating encryption and filtering directly into ATM switches and in this way offering what are called virtual private network services to users (Chapters 17 and 19).
The third stage of ATM connection management is concerned with closing a circuit and freeing dynamically associated resources (mainly, table entries in the switches). This occurs when the circuit is no longer needed. ATM systems that emulate IP networks or that use permanent virtual circuits are able to skip this final stage, leaving a single set of connections continuously open, and perhaps dedicating some part of the aggregate bandwidth of the switch to each such connection. As we evolve to more direct use of ATM, one of the reliability issues that may arise will be that of detecting failures so that any ATM circuits opened by a process that later crashed will be safely and automatically closed on its behalf. Protection of the switching network against applications that erroneously (or maliciously) attempt to monopolize resources by opening a great many virtual circuits will also need to be addressed in future systems.
ATM poses some challenging software issues. Communication at gigabit rates will require substantial architectural evolution and may not be feasible over standard OSI-style protocol stacks, because of the many layers of software and protocols that messages typically traverse in these architectures. As noted above, ATM seems likely to require that video servers and disk data servers be connected directly to the "wire", because the overhead and latency associated with fetching data into a processor’s memory before transmitting it can seem very large at the extremes of performance for which ATM is intended. These factors make it likely that although ATM will be usable in support of networks of high performance workstations, the technology will really take off in settings that exploit novel computing devices and new types of software architectures. These issues are already stimulating rexamination of some of the most basic operating system structures, and when we look at high speed communication in Chapter 8, many of the technologies considered turn out to have arisen as responses to this challenge.
Even layering the basic Internet protocols over ATM has turned out to be non-trivial. Although it is easy to fragment an IP packet into ATM cells, and the emulation mode mentioned above makes it straightforward to emulate IP networking over ATM networks, traditional IP software will drop an entire IP packet if any part of the data within it is corrupted. An ATM network that drops even a single cell per IP packet would thus seem to have 0% reliability, even though close to 99% of the data might be getting through reliably. This consideration has motivated ATM vendors to extend their hardware and software to understand IP and to arrange to drop all of an IP packet if even a single cell of that packet must be dropped, an example of a simple quality-of-service property. The result is that as the ATM network becomes loaded and starts to shed load, it does so by beginning to drop entire IP packets, hopefully with the result that other IP packets will get through unscathed. This leads us to the use of the packet type identifier bit: the idea is that in a burst of packets, the first packet can be identified by setting this bit to 0, and subsequent "subordinate" packets identified by setting it to 1. If the ATM must drop a cell, it can then drop all subsequent cells with the same VCI until one is encountered with the PTI bit set to 0, on the theory that all of these cells will be discarded in any case upon reception, because of the prior lost cell.
Looking to the future, it should not be long before IP drivers or special ATM firmware is developed that can buffer outgoing IP packets briefly in the controller of the sender and selectively solicit retransmission of just the missing cells if the receiving controller notices that data is missing. One can also imagine protocols whereby the sending ATM controller might compute and periodically transmit a parity cell containing the exclusive-or of all the prior cells for an IP packet; such a parity cell could then be used to reconstruct a single missing cell on the receiving side. Quality of service options for video data transmission using MPEG or JPEG may soon be introduced. Although these suggestions may sound complex and costly, keep in mind that the end-to-end latencies of a typical ATM network are so small (tens of microseconds) that it is entirely feasible to solicit the retransmission of a cell or two this even as the data for the remainder of the packet flows through the network. With effort, such steps should eventually lead to very reliable IP networking at ATM speeds. But the non-trivial aspects of this problem also point to the general difficulty of what, at first glance, might have seemed to be a completely obvious step to take. This is a pattern that we will often encounter throughout the remainder of the book!
Parallel supercomputer architectures, and their inexpensive and smaller-scale cousins, the cluster computer systems, have a natural correspondence to distributed systems. Increasingly, all three classes of systems are structured as collections of processors connected by high speed communications buses and with message passing as the basic abstraction. In the case of cluster computing systems, these communications buses are often based upon standard technologies such as fast ethernet or packet switching similar to that used in ATM. However, there are significant differences too, both in terms of scale and properties. These considerations make it necessary to treat cluster and parallel computing as a special case of distributed computing for which a number of optimizations are possible, and where special considerations are also needed in terms of the expected nature of application programs and their goals vis-a-vis the platform.
In particular, cluster and parallel computing systems often have built-in management networks that make it possible to detect failures extremely rapidly, and may have special purpose communication architectures with extremely regular and predictable performance and reliability properties. The ability to exploit these features in a software system creates the possibility that developers will be able to base their work on the general-purpose mechanisms used in general distributed computing systems, but to optimize them in ways that might greatly enhance their reliability or performance. For example, we will see that the inability to accurately sense failures is one of the hardest problems to overcome in distributed systems: certain types of network failures can create conditions indistinguishable from processor failure, and yet may heal themselves after a brief period of disruption, leaving the processor healthy and able to communicate again as if it had never been gone. Such problems do not arise in a cluster or parallel architecture, where accurate failure detection can be "wired" to available hardware features of the communications interconnect.
In this textbook, we will not consider cluster or parallel systems until Chapter 24, at which time we will ask how the special properties of such systems impacts the algorithmic and protocol issues that we consider in the previous chapters. Although there are some important software systems for parallel computing (PVM is the best known [GDBJ94]; MPI may eventually displace it [MPI96]), these are not particularly focused on reliability issues, and hence will be viewed as being beyond the scope of the current treatment.
Few areas of technology development are as active as that involving basic communication technologies. The coming decade should see the introduction of powerful wireless communication technologies for the office, permitting workers to move computers and computing devices around a small space without the rewiring that contemporary devices often require. Bandwidth delivered to the end-user can be expected to continue to rise, although this will also require substantial changes in the software and hardware architecture of computing devices, which currently limits the achievable bandwidth for traditional network architectures. The emergence of exotic computing devices targetted to single applications should begin to displace general computing systems from some of these very demanding settings.
Looking to the broader internet, as speeds are rising, so too is congestion and contention for network resources. It is likely that virtual private networks, supported through a mixture of software and hardware, will soon become available to organizations able to pay for dedicated bandwidth and guaranteed latency. Such networks will need to combine strong security properties with new functionality, such as conferencing and multicast support. Over time, it can be expected that these data oriented networks will merge into the telecommunications "intelligent network" architecture, which provides support for voice, video and other forms of media, and mobility. All of these features will present the distributed application developer with new options, as well as new reliability challenges.
Reliability of the telecommunications architecture is already a concern, and that concern will only grow as the public begins to insist on stronger guarantees of security and privacy. Today, the rush to deploy new services and to demonstrate new communications capabilities has somewhat overshadowed robustness issues of these sorts. One consequence, however, has been a rash of dramatic failures and attacks on distributed applications and systems. Shortly after work on this book began, a telephone "phreak" was arrested for reprogramming the telecommunications switch in his home city in ways that gave him nearly complete control over the system, from the inside. He was found to have used his control to misappropriate funds through electronic transfers, and the case is apparently not an isolated event. Meanwhile, new services such as "caller id" have turned out to have unexpected side-effects, such as permitting companies to build databases of the telephone numbers of the individuals who contact them. Not all of these individuals would have agreed to divulge their numbers.
Such events, understandably, have drawn considerable public attention and protest. As a consequence, they contribute towards a mindset in which the reliability implications of technology decisions are being given greater attention. Such the trend continue, it could eventually lead to wider use of technologies that promote distributed computing reliability, security and privacy over the coming decades.
Addtional discussion of the topics covered in this chapter can be found in [Tan88, Com91, CS91, CS93,CDK94]. An outstanding treatment of ATM is [HHS94].
A communications standard is a collection of specifications governing the types of messages that can be sent in a system, the formats of message headers and trailers, the encoding rules for placing data into messages, and the rules governing format and use of source and destination addresses. In addition to this, a standard will normally specify a number of protocols that a provider should implement.
Examples of communications standards that are used widely, although not universally so, are:
During the 1990’s, the emergence of "open systems", namely systems in which computers from different vendors and running independently developed software, has been an important trend. Open systems favor standards, but also must support current practice, since vendors otherwise find it hard to move their customer base to the standard. At the time of this writing, the trend clearly favors the Internet protocol suite as the most widely supported communications standard, with the Novell protocols strongly represented by force of market share. However, there protocol suites were designed long before the advent of modern high speed communications devices, and the commercial pressure to develop and deploy new kinds of distributed applications that exploit gigabit networks could force a rethinking of these standards. Indeed, even as the Internet has become a "de facto" standard, it has turned out to have serious scaling problems that may not be easy to fix in less than a few years (see Figure 3-1).
The remainder of this chapter focuses on the Internet protocol suite because this is the one used by the Web. Details of how the suite is implemented can be found in [Com91,CS91,CS93].
The addressing tools in a distributed communication system provide unique identification for the source and destination of a message, together with ways of mapping from symbolic names for resources and services to the corresponding network address, and for obtaining the best route to use for sending messages.
Addressing is normally standardized as part of the general communication specifications for formatting data in messages, defining message headers, and communicating in a distributed environment.
Within the Internet, several address formats are available, organized into "classes" aimed at different styles of application. Each class of address is represented as a 32-bit number. Class A internet addresses have a 7-bit network identifier and a 24-bit host-identifier, and are reserved for very large networks. Class B addresses have 14 bits for the network identifier and 16 bits for the host-id, and class C has 21 bits of network identifier and 8 bits for the host-id. These last two classes are the most commonly used. Eventually, the space of internet addresses is likely to be exhausted, at which time a transition to an extended IP address is planned; the extended format increases the size of addresses to 64 bits but does so in a manner that provides backwards compatibility with existing 32-bit addresses. However, there are many hard problems raised by such a transition and industry is clearly hesitant to embark on what will be a hugely disruptive process.
Internet addresses have a standard ASCII representation, in which the bytes of the address are printed as signed decimal numbers in a standardized order. For example, this book was edited on host gunnlod.cs.cornell.edu, which has internet address 220.127.116.11. This is a class B internet address, with network address 42 and host-id 218.58. Network address 42 is assigned to Cornell University, as one of several class B addresses used by the University. The 218.xxx addresses designate a segment of Cornell’s internal network, namely the ethernet to which my computer is attached. The number 58 was assigned within the Computer Science Department to identify my host on this ethernet segment.
A class D internet address is intended for special uses: IP multicasting. These addresses are allocated for use by applications that exploit IP multicast. Participants in the application join the multicast group, and the internet routing protocols automatically reconfigure themselves to route messages to all group members.
The string "gunnlod.cs.cornell.edu" is a symbolic name for IP address. The name consists of a machine name (gunnlod, an obscure hero of Norse mythology) and a suffix (cs.cornell.edu) designating the Computer Science Department at Cornell University, which is an educational institution in the United States. The suffix is registered with a distributed service called the domain name service, or DNS, which supports a simple protocol for mapping from string names to IP network addresses.
Here’s the mechanism used by the DNS when it is asked to map my host name to the appropriate IP address for my machine. DNS has a top-level entry for "edu" but doesn’t have an Internet address for this entry. However, DNS resolves cornell.edu to a gateway address for the Cornell domain, namely host 18.104.22.168. Finally, DNS has an even more precise address stored for cs.cornell.edu, namely 22.214.171.124 – a mail server and gateway machine in the Computer Science Department. All messages to machines in the Computer Science Department pass through this machine, which intercepts and discards messages to all but a select set of application programs.
DNS is itself structured as a hierarchical database of slowly changing information. It is hierarchical in the sense that DNS servers form a tree, with each level providing addresses of objects in the level below it, but also caching remote entries that are frequently used by local processes. Each DNS entry tells how to map some form of ascii hostname to the corresponding IP machine address or, in the case of commonly used services, how to find the service representative for a given host name.
Thus, DNS has an entry for the IP address of gunnlod.cs.cornell.edu (somewhere), and can track it down using its resolution protocol. If the name is used rapidly, the information may become cached local to the typical users and will resolve quickly; otherwise the protocol sends the request up the hierarchy to a level at which DNS knows how to resolve some part of the name, and then back down the hierarchy to a level that can fully resolve it. Similarly, DNS has a record telling how to find a mail transfer agent running the SMTP protocol for gunnlod.cs.cornell.edu: this may not be the same machine as gunnlod itself, but the resolution protocol is the same.
Internet Brownouts: Power Failures on the Data Superhighway?
Begining in late 1995, clear signs emerged that the Internet was beginning to overload. One reason is that the "root" servers for the DNS architecture are experiencing exponential growth in the load of DNS queries that require action by the top levels of the DNS hierarchy. A server that saw 10 queries per minute in 1993 was up to 250 queries per second in early 1995, and traffic was doubling every three months. Such problems point to fundamental aspects of the Internet that were based on assumptions of a fairly small and lightly loaded user population that repeatedly performed the same sorts of operations. In this small world, it makes sense to use a single hierarchical DNS structure with caching, because cache hits were possible for most data. In a network that suddenly has millions of users, and that will eventually support billions of users, such design considerations must be reconsidered: only a completely decentralized architecture can possibly scale to support a truely universal and world-wide service.
These problems have visible but subtle impact on the internet user: they typically cause connections to break, or alert boxes to appear on your Web browser warning you that the host possessing some resource is "unavailable." There is no obvious way to recognize that the problem is not one of local overload or congestion, but in fact is an overloaded DNS server or one that has crashed at a major Internet routing point. Unfortunately, such problems have become increasingly common: the Internet is starting to experience brownouts. Indeed, the Internet became largely unavailable because of failures of this nature for many hours during one crash in September of 1995, and this was hardly an unusual event. As the data superhighway becomes increasingly critical, such brownouts represent increasingly serious threats to reliability.
Conventional wisdom has it that the Internet does not follow the laws of physics, there is no limit to how big, fast and dense the Internet can become. Like the hardware itself, which seems outmoded almost before it reaches the market, we assume that the technology of the network is also speeding up in ways that outrace demand. But the reality of the situation is that the software architecture of the Internet is in some basic ways not scalable. Short of redesigning these protocols, the Internet won’t keep up with growing demands. In some ways, it already can’t.
Several problems are identified as the most serious culprits at the time of this writing. Number one in any ranking: the World Wide Web. The Web has taken over by storm, but it is inefficient in the way it fetches documents. In particular, as we will see in Chapter 10, the HTTP protocol often requires that large numbers of connections be created for typical document transfers, and these connections (even for a single HTML document) can involve contacting many separate servers. Potentially, each of these connection requests forces the root nodes of the DNS to respond to a query. With millions of users "surfing the network", DNS load is skyrocketing.
Bandwidth requirements are also growing exponentially. Unfortunately, the communication technology of the Internet is scaling more slowly than this. So overloaded connections, particularly near "hot sites", are a tremendous problem. A popular Web site may receive hundreds of requests per second, and each request must be handled separately. Even if the identical bits are being transmitted concurrently to hundreds of users, each user is sent its own, private copy. And this limitation means that as soon as a server becomes useful or interesting, it also becomes vastly overloaded. Yet ven though identical bits are being sent to hundreds of thousands of destinations, the protocols offer no obvious way to somehow multicast the desired data, in part because Web browsers explicitly make a separate connection for each object fetched, and only specify the object to send after the connection is in place. At the time of this writing, the best hope is that popular documents can be cached with increasing efficiency in "web proxies", but as we will see, doing so also introduces tricky issues of reliability and consistency. Meanwhile, the bandwidth issue is with us to stay.
Internet routing is another area that hasn’t scaled very well. In the early days of the Internet, routing was a major area of research, and innovative protocols were used to route around areas of congestion. But these protocols were eventually found to be consuming too much bandwidth and imposing considerable overhead: early in the 1980’s, 30% of Internet packets were associated with routing and load-balancing. A new generation of relatively static routing protocols was proposed at that time, and remain in use today. But the assumptions underlying these "new" reflected a network that, at the time, seemed "large" because it contained hundreds of nodes. A network of tens of millions or billions of nodes poses problems that could never have been anticipated in 1985. Now that we have such a network, even trying to understand its behavior is a major challenge. Meanwhile, when routers fail (for reasons of hardware, software, or simply because of overload), the network is tremendously disrupted.
The Internet Engineering Task Force (IETF), a governing body for the Internet and for Web protocols, is working on this problems. This organization sets the standards for the network and has the ability to legislate solutions. A variety of proposals are being considered: they include ways of optimizing the Web protocol called HTTP, and other protocol optimizations.
Some service providers are urging the introduction of mechanisms that would charge users based on the amount of data they transfer and thus discourage overuse (but one can immediately imagine the parents of an enthusiastic 12-year old forced to sell their house to pay the monthly network bill). There is considerable skepticism that such measures are practical. Bill Gates has suggested that in this new world, one can easily charge for the "size of the on-ramp" (the bandwidth of one’s connection), but not for the amount of information a user transfers, and early evidence supports his perspective. In Gate’s view, this is simply a challenge of the new Internet market.
There is no clear solution to the Internet bandwidth problem. However, as we will see in the textbook, there are some very powerful technologies that could begin to offer answers: coherent replication and caching being the most obvious remedy for many of the problems cited above. The financial motivations for being first to market with the solution are staggering, and history shows that this is a strong incentive indeed.
Figure -1: The data superhighway is experiencing serious growing pains. Growth in load has vastly exceeded the capacity of the protocols used in the Internet and World-Wide-Web. Issues of consistency, reliability, and availability in technologies such as the ones that support these applications are at the core of this textbook.
The internet address specifies a machine, but the identification of the specific application program that will process the message is also important. For this purpose, internet addresses contain a field called the port number, which is at present a 16-bit integer. A program that wishes to receive messages must bind itself to a port number on the machine to which the messages will be sent. A predefined list of port numbers is used by standard system services, and have values in the range 0-1023. Symbolic names have been assigned to many of these predefined port numbers, and a table mapping from names to port numbers is generally provided.
For example, messages sent to gunnlod.cs.cornell.edu that specify port 53 will be delivered to the DNS server running on machine gunnlod, or discarded if the server isn’t running. Email is sent using a subsystem called SMTP, on port-number 25. Of course, if the appropriate service program isn’t running, messages to a port will be silently discarded. Small port numbers are reserved for special services and are often "trusted", in the sense that it is assumed that only a legitimate SMTP agent will ever be connected to port 25 on a machine. This form of trust depends upon the operating system, which decides whether or not a program should be allowed to bind itself to a requested port.
Port numbers larger than 1024 are available for application programs. A program can request a specific port, or allow the operating system to pick one randomly. Given a port number, a program can register itself with the local network information service (NIS) program, giving a symbolic name for itself and the port number that it is listening on. Or, it can send its port number to some other program, for example by requesting a service and specifying the internet address and port number to which replies should be transmitted.
The randomness of port selection is, perhaps unexpectedly, an important source of security in many modern protocols. These protocols are poorly protected against intruders, who could attack the application if they were able to guess the port numbers being used. By virtue of picking port numbers randomly, the protocol assumes that the barrier against attack has been raised substantially, and hence that it need only protect against accidental delivery of packets from other sources: presumably an infrequent event, and one that is unlikely to involve packets that could be confused with the ones legitimately used by the protocol on the port. Later, however, we will see that such assumptions may not always be safe: modern network hackers may be able to steal port numbers out of IP packets; indeed, this has become a serious enough problem so that proposals for encrypting packet headers are being considered by the IETF.
Not all machines have identical byte orderings. For this reason, the internet protocol suite specifies a standard byte order that must be used to represent addresses and port numbers. On a host that does not use the same byte order as the standard requires, it is important to byte-swap these values before sending a message, or after receiving one. Many programming languages include communication libraries with standard functions for this purpose.
Finally, notice that the network services information specifies a protocol to use when communicating with a service – TCP, when communicating with the uucp service, UDP when communication with the tftp service (a file transfer program), and so forth. Some services support multiple options, such as the domain name service. As we discussed earlier, these names refer to protocols in the internet protocol suite.
This section presents the three major components of the internet protocol suite: the IP protocol, on which the others are based, and the TCP and UDP protocols, which are the ones normally employed by applications. We also discuss some recent extentions to the IP protocol layer in support of IP multicast protocols. There has been considerable discussion of security for the IP layer, but no single proposal has gained wide acceptance as of the time of this writing, and we will say very little about this ongoing work for reasons of brevity.
The lowest layer of the internet protocol suite is a connectionless packet transport protocol called IP. IP is responsible for unreliable transport of variable size packets (but with a fixed maximum size, normally 1400 bytes), from the sender’s machine to the destination machine. IP packets are required to conform to a fixed format consisting of a variable-length packet header, a variable-length body, and an optional trailer. The actual lengths of the header, body, and trailer are specified through length fields that are located at fixed offsets into the header. An application that makes direct use of IP is expected to format its packets according to this standard. However, direct use of IP is normally restricted because of security issues raised by the prospect of applications that might exploit such a feature to "mimic" some standard protocol, such as TCP, but do so in a non-standard way that could disrupt remote machines or create security loopholes.
Implementations of IP normally provide routing functionality, using either a static or dynamic routing architecture. The type of routing used will depend upon the complexity of the installation and its configuration of of the internet software, and is a topic beyond the scope of this textbook.
In 1995, IP was enhanced to provide a security architecture whereby packet payloads can be encrypted to prevent intruders from determining packet contents, and providing options for signatures or other authentication data in the packet trailer. Encryption of the packet header is also possible within this standard, although use of this feature is possible only if the routing layers and IP software implementation on all machines in the network agree upon the encryption method to use.
TCP is a name for the connection-oriented protocol within the internet protocol suite. TCP users start by making a TCP connection, which is done by having one program set itself up to listen for and accept incoming connections, while the other connects to it. A TCP connection guarantees that data will be delivered in the order sent, without loss or duplication, and will report an "end of file" if the process at either end exits or closes the channel. TCP connections are byte-stream oriented: although the sending program can send blocks of bytes, the underlying communication model views this communication as a continuous sequence of bytes. TCP is thus permitted to lose the boundary information between messages, so that what is logically a single message may be delivered in several smaller chunks, or delivered together with fragments of a previous or subsequent message (always preserving the byte ordering, however!). If very small messages are transmitted, TCP will delay them slightly to attempt to fill larger packets for efficient transmission; the user must disable this behavior if immediate transmission is desired.
Applications that involve concurrent use of a TCP connection must interlock against the possibility that multiple write operations will be done simultaneously on the same channel; if this occurs, then data from different writers can be interleaved when the channel becomes full.
UDP is a message or "datagram" oriented protocol. With this protocol, the application sends messages which are preserved in the form sent and delivered intact, or not at all, to the destination. No connection is needed, and there are no guarantees that the message will get through, or that messages will be delivered in any particular order, or even that duplicates will not arise. UDP imposes a size limit of 8k bytes on each message: an application needing to send a large message must fragment it into 8k chunks.
Internally, UDP will normally fragment a message into smaller pieces, which correspond to the maximum sizeof an IP packet, and matches closely with the maximum size packet that an ethernet can transmit in a single hardware packet. If a UDP packet exceeds the maximum IP packet size, the UDP packet is sent as a series of smaller IP packets. On reception, these are reassembled into a larger packet. If any fragment is lost, the UDP packet will eventually be discarded.
The reader may wonder why this sort of two-level fragmentation scheme is used – why not simply limit UDP to 1400 bytes, too? To understand this design, it is helpful to start with a measurement of the cost associated with a communication system call. On a typical operating system, such an operation has a minimum overhead of 20- to 50-thousand instructions, regardless of the size of the data object to be transmitted. The idea, then, is to avoid repeatedly traversing long code paths within the operating system. When an 8k-byte UDP packet is transmitted, the code to fragment it into smaller chunks executes "deep" within the operating system. This can save 10’s of thousands of instructions.
One might also wonder why communication needs to be so expensive, in the first place. In fact, this is a very interesting and rather current topic, particularly in light of recent work that has reduced the cost of sending a message (on some platforms) to as little as 6 instructions. In this approach, which is called Active Messages [ECGS92, EBBV95], the operating system is kept completely off the message path, and if one is willing to paya slightly higher price, a similar benefit is possible even in a more standard communications architecture (see Section 8.3). Looking to the future, it is entirely plausible to believe that commercial operating systems products offering comparably low latency and high throughput will start to be available in the late 1990’s. However, the average operating system will certainly not catch up with the leading edge approaches for many years. Thus, applications may have to continue to live with huge and in fact unecessary overheads for the time being.
IP multicast is a relatively recent addition to the Internet protocol suite [Der88,Der89,DC90]. With IP multicast, UDP or IP messages can be transmitted to groups of destinations, as opposed to a single point to point destination. The approach extends the multicast capabilities of the ethernet interface to work even in complex networks with routing and bridges between ethernet segments.
IP multicast is a session-oriented protocol: some work is required before communication can begin. The processes that will communicate must create an IP multicast address, which is a class-D Internet address containing a multicast identifier in the lower 28 bits. These processes must also agree upon a single port number that all will use for the communication session. As each process starts, it installs IP address into its local system, using system calls that place the IP multicast address on the ethernet interface(s) to which the machine is connected. The routing tables used by IP, discussed in more detail below, are also updated to ensure that IP multicast packets will be forwarded to each destination and network on which group members are found.
Once this setup has been done, an IP multicast is initiated by simply sending a UDP packet with the IP multicast group address and port number in it. As this packet reaches a machine which is included in the destination list, a copy is made and delivered to local applications receiving on the port. If several are bound to the same port on the same machine, a copy is made for each.
Like UDP, IP multicast is an unreliable protocol: packets can be lost, duplicated or delivered out of order, and not all members of a group will see the same pattern of loss and delivery. Thus, although one can build reliable communication protocols over IP multicast, the protocol itself is inherently unreliable.
When used through the UDP interface, a UDP multicast facility is similar to a UDP datagram facility, in that each packet can be as long as the maximum size of UDP transmissions, which is typically 8k. However, when sending an IP or UDP multicast, it is important to remember that the reliability observed may vary from destination to destination. One machine may receive a packet that others drop because of memory limitations or corruption caused by a weak signal on the communications medium, and the loss of even a single fragment of a large UDP message will cause the entire message to be dropped. Thus, one talks more commonly about IP multicast than UDP multicast, and it is uncommon for applications to send very large messages using the UDP interface. Any application that uses this transport protocol should carefully instrument loss rates, because the effective performance for small messages may actually be better than for large ones due to this limitation.
Routing is the method by which a communication system computes the path by which packets will travel from source to destination. A routed packet is said to take a series of hops, as it is passed from machine to machine. The algorithm used is generally as follows:
A number of methods have been developed for maintaining routing tables. The most common approach is to use static routing. In this approach, the routing table is maintained by system administrators, and is never modified while the system is active.
Dynamic routing is a class of protocols by which machines can adjust their routing tables to benefit from load changes, route around congestion and broken links, reconfigure to exploit links that have recovered from failures. In the most common approaches, machines periodically distribute their routing tables to nearest neighbors, or periodically broadcast their routing tables within the network as a whole. For this latter case, a special address is used that causes the packet to be routed down every possible interface in the network; a hop-count limit prevents such a packet from bouncing endlessly.
The introduction of IP multicast has resulted in a new class of routers that are static for most purposes, but that maintain special dynamic routing policies for use when an IP multicast group spans several segments of a routed local area network. In very large settings, this multicast routing daemon can take advantage of the multicast backbone or mbone network to provide group communication or conferencing support to sets of participants working at physically remote locations. However, most use of IP multicast is limited to local area networks at the time of this writing, and wide-area multicast remains a somewhat speculative research topic.
The reader may be curious about the following issue. The architecture described above permits packets to be lost at each hop in the communication subsystem. If a packet takes many hops, the probability of loss would seem likely to grow proportionately, causing the reliability of the network to drop linearly with the diameter of the network. There is an alternative approach in which error correction would be done hop by hop. Although packets could still be lost if an intermediate machine crashes, such an approach would have loss rates that are greatly reduced, at some small but fixed background cost (when we discuss the details of reliable communication protocols, we will see that the overhead need not be very high). Why, then, do most systems favor an approach that seems likely to be much less reliable?
In a classic paper, Jerry Saltzer and others took up this issue in 1984 [SRC84]. This paper compared "end to end" reliability protocols, which operate only between the source and destination of a message, with "hop by hop" reliable protocols. They argued that even if reliability of a routed network is improved by the use of hop-by-hop reliability protocols, it will still not be high enough to completely overcome packet loss. Packets can still be corrupted by noise on the lines, machines can crash, and dynamic routing changes can bounce a packet around until it is discarded. Moreover, they argue, the measured average loss rates for lightly to moderately loaded networks are extremely low. True, routing exposes a packet to repeated threats, but the overall reliability of a routed network will still be very high on the average, with worst case behavior dominated by events like routing table updates and crashes that hop-by-hop error correction would not overcome. From this the authors conclude that since hop-by-hop reliability methods increase complexity and reduce performance, and yet must still be duplicated by end-to-end reliability mechanisms, one might as well use a simpler and faster link-level communication protocol. This is the "end to end argument" and has emerged as one of the defining principles governing modern network design.
Saltzer’s paper revolves around a specific example, involving a file transfer protocol. The paper makes the point that the analysis used is in many ways tied to the example and the actual reliability properties of the communication lines in question. Moreover, Saltzer’s interest was specifically in reliability of the packet transport mechanism: failure rates and ordering. These points are important because many authors have come to cite the end-to-end argument in a much more expansive way, claiming that it is an absolute argument against putting any form of "property" or "guarantee" within the communication subsystem. Later, we will be discussing protocols that need to place properties and guarantees into subsystems, as a way of providing system-wide properties that would not otherwise be achievable. Thus, those who accept the "generalized" end-to-end argument would tend to oppose the use of these sorts of protocols on philisophical (one is tended to say "religious") grounds.
A more mature view is that the end-to-end argument is one of those situations where one should accept its point with a degree of skepticism. On the one hand, the end-to-end argument is clearly correct in situations where an analysis comparable to Saltzer’s original one is possible. However, the end-to-end argument cannot be applied blindly: there are situations in which low level properties are beneficial and genuinely reduce complexity and cost in application software, and for these situations, an end-to-end approach might be inappropriate, leading to more complex applications that are error prone or, in a practical sense, impossible to construct.
For example, in a network with high link-level loss rates, or one that is at serious risk of running out of memory unless flow control is used link-to-link, an end-to-end approach may result in near-total packet loss, while a scheme that corrects packet loss and does flow control at the link level could yield acceptable performance. Thus, then, is a case in which Saltzer’s analysis could be applied as he originally formulated it, but would lead to a different conclusion. When we look at the reliability protocols presented in the third part of this textbook, we will see that certain forms of consistent distributed behavior (such as is needed in a fault-tolerant coherent caching scheme) depend upon system-wide agreement that must be standardized and integrated with low-level failure reporting mechanisms. Omitting such a mechanism from the transport layer merely forces the application programmer to build it as part of the application; if the programming environment is intended to be general and extensible, this may mean that one makes the mechanism part of the environment or gives up on it entirely. Thus, when we look at distributed programming environments like the CORBA architecture, seen in Chapter 6, there is in fact a basic design choice to be made: either such a function is made part of the architecture, or by omitting it, no application can achieve this type of consistency in a general and interoperable way except with respect to other applications implemented by the same development team. These examples illustrate that, like many engineering arguments, the end-to-end approach is highly appropriate in certain situations, but not uniformly so.
We have reviewed most stages of the communication architecture that interconnects a sending application to a receiving application. But what of the operating system software at the two ends?
The communications software of a typical operating system is modular, organized as a set of components that subdivide the tasks associated with implementing the protocol stack or stacks in use by application programs. One of these components is the buffering subsystem, which maintains a collection of kernel memory buffers that can be used to temporarily store incoming or outgoing messages. On most UNIX systems, these are called mbufs, and the total number available is a configuration parameter that should be set when the system is built. Other operating systems allocate buffers dynamically, competing with the disk I/O subsystem and other I/O subsystems for kernel memory. All operating systems share a key property, however: the amount of buffering space available is limited.
The TCP and UDP protocols are implemented as software modules that include interfaces up to the user, and down to the IP software layer. In a typical UNIX implementation, these protocols allocate some amount of kernel memory space for each open communication "socket", at the time the socket is created. TCP, for example, allocates an 8-kbyte buffer, and UDP allocates two 8k-byte buffers, one for transmission and one for reception (both can often be increased to 64kbytes). The message to be transmitted is copied into this buffer (in the case of TCP, this is done in chunks if necessary). Fragments are then generated by allocating successive memory chunks for use by IP, copying the data to be sent into them, prepending an IP header, and then passing them to the IP sending routine. Some operating systems avoid one or more of these copying steps, but this can increase code complexity, and copying is sufficiently fast that many operating systems simply copy the data for each message multiple times. Finally, IP identifies the network interface to use by searching the routing table and queues the fragments for transmission. As might be expected, incoming packets trace the reverse path.
An operating system can drop packets or messages for reasons unrelated to the hardware corruption or duplication. In particular, an application that tries to send data as rapidly as possible, or a machine that is presented with a high rate of incoming data packets, can exceed the amount of kernel memory that can safely be allocated to any single application. Should this happen, it is common for packets to be discarded until memory usage drops back below threshold. This can result in unexpected patterns of message loss.
For example, consider an application program that simply tests packet loss rates. One might expect that as the rate of transmission is gradually increased, from one packet per second to 10, then 100, then 1000 the overall probability that a packet loss will occur would remain fairly constant, hence packet loss will rise in direct proportion to the actual number of packets sent. Experiments that test this case, running over UDP, reveal quite a different pattern, illustrated in Figure 3-2; the left graph is for a sender and receiver on the same machine (the messages are never actually put on the wire in this case), and the right the case of a sender and receiver on identical machines connected by an ethernet.
As can be seen from the figure, the packet loss rate, as a percentage, is initially low and constant: zero for the local case, and small for the remote case. As the transmission rate rises, however, both rates rise. Presumably, this reflects the increased probability of memory threshold effects in the operating system. However, as the rate rises still further, behavior breaks down completely! For high rates of communication, one sees bursty behavior in which some groups of packets are delivered, and others are completely lost. Moreover, the aggregate throughput can be quite low in these overloaded cases, and the operating system often reports no errors at all the sender and destination – on the sending side, the loss occurs after UDP has accepted a packet, when it is unable to obtain memory for the IP fragments. On the receiving side, the loss occurs when UDP packets turn out to be missing fragments, or when the queue of incoming messages exceeds the limited capacity of the UDP input buffer.
The quantized scheduling algorithms used in multitasking operating systems like UNIX probably accounts for the bursty aspect of the loss behavior. UNIX tends to schedule processes for long periods, permitting the sender to send many packets during congestion periods, without allowing the receiver to run to clear its input queue in the local case, or giving the interface time to transmitted an accumulated backlog in the remote case. The effect is that once a loss starts to occur, many packets can be lost before the system recovers. Interestingly, packets can also be delivered out of order when tests of this sort are done, presumably reflecting some sort of stacking mechanisms deep within the operating system. Thus, the same measurements might yield different results on other versions of UNIX or other operating systems. However, with the exception of special purpose communication-oriented operating systems such as QNX (a real-time system for embedded applications), one would expect a "similar" result for most of the common platforms used in distributed settings today.
TCP behavior is much more reasonable for the same tests, but there are other types of tests for which TCP can behave poorly. For example, if one processes makes a great number of TCP connections to other processes, and then tries to transmit multicast messages on the resulting 1-many connections, the measured throughput drops worse than linearly, as a function of the number of connections, for most operating systems. Moreover, if groups of processes are created and TCP connections are opened between them, pairwise, performance is often found to be extremely variable – latency and throughput figures can vary wildly even for simple patterns of communications.
UDP or IP multicast gives the same behavior as UDP. However, the user ofmulticast should also keep in mind that many sources of packet loss can result in different patterns of reliability for different recievers. Thus, one destination of a multicast transmission may experience high loss rates even if many other destinations receive all messages with no losses at all. Problems such as this are potentially difficult to detect and are very hard to deal with in software.
Although widely available, TCP, UDP and IP are also limited in the functionality they provide and their flexibility. This has motivated researchers to investigate new and more flexible protocol development architectures that can co-exist with TCP/IP but support varying qualities of transport service that can be matched closely to the special needs of demanding applications.
Prominent among such efforts is the Xpress Transfer Protocol (XTP), which is a toolkit of mechanisms that can be exploited by users to customize data transfer protocols operating in a point to point or multicast environment. All aspects of the the protocol are under control of the developer, who sets option bits during individual packet exchanges to support a highly customizable suite of possible comunication styles. References to this work include [SDW92,XTP95,DFW90].
XTP is a connection oriented protocol, but one in which the connection setup and closing protocols can be varied depending on the needs of the application. A connection is identified by a 64-bit key; 64-bit sequence numbers are used to identify bytes in transit. XTP does not define any addressing scheme of its own, but is normally combined with IP addressing. An XTP protocol is defined as an exchange of XTP messages. Using the XTP toolkit, a variety of options can be specified for each message transmitted; the effect is to support a range of possible "qualities of service" for each communication session. For example, an XTP protocol can be made to emulate UDP or TCP-style streams, to operate in an unreliable source to destination mode, with selective retransmission based on negative acknowledgements, or can even be asked to "go back" to a previous point in a transmission and to resume. Both rate-based and windowing flow control mechanisms are available for each connection, although one or both can be disabled if desired. The window size is configured by the user at the start of a connection, but can later be varied while the connection is in use, and a set of traffic parameters can be used to specify requirements such as the maximum size of data segments that can be transmitted in each packet, maximum or desired burst data rates, and so forth. Such parameters permit the development of general purpose transfer protocols that can be configured at runtime to match the properties of the hardware environment.
This flexibility is exploited in developing specialized transport protocols that may look like highly optimized version of the standard ones, but that can also provide very unusual properties. For example, one could develop a TCP-style of stream that will reliable provided that the packets sent arrive "on time", using a user-specific notion of time, but that drops packets if they timeout. Similarly, one can develop protocols with out-of-band or other forms of priority-based services.
At the time of this writing, XTP was gaining significant support from industry leaders whose future product lines potentially require flexibility from the network. Video servers, for example, are poorly matched to the communication properties of TCP connections, hence companies that are investing heavily in "video on demand" face the potential problem of having products that work well in the laboratory but not in the field, because the protocol architecture connecting customer applications to the server is inappropriate. Such companies are interested in developing proprietary data transport protocols that would essentially extend their server products into the network itself, permitting fine-grained control over the communication properties of the environment in which their servers operate, and overcoming limitations of the more traditional but less flexible transport protocols.
In Chapters 13 through 16 we will be studying special purpose protocols designed for settings in which reliability requires data replication or specialized performance guarantees. Although we will generally present these protocols in the context of streams, UDP, or IP-multicast, it is likely that the future will bring a considerably wider set of transport options that can be exploited in applications with these sorts of requirements.
There is, however, a downside associated with the use of customized protocols based on technologies such as XTP: they can create difficult management and monitoring problems, which will often go well beyond those seen in standard environments where tools can be developed to monitor a network and to display, in a well organized manner, the status of the network and applications. Such tools benefit from being able to intercept network traffic and to associate the message sent with the applications sending them. To the degree that technologies such as XTP lead to extremely specialized patterns of communication that work well for individual applications, they may also reduce this desirable form of regularity and hence impose obstacles to system control and management.
Broadly, one finds a tension within the networking community today. On the one side are developers convinced that their special-purpose protocols are necessary if a diversity of communications products and technologies are to be feasible over networks such as the Internet. In some sense this community generalizes to also include the community that develops special purpose reliability protocols and that may need to place "properties" within the network to support those protocols. On the other stand the system administrators and managers, whose lives are already difficult, and who are extremely resistant to technologies that might make this problem worse. Sympathizing with them are the performance experts of the operating systems communications community: this group favors an end-to-end approach because it greatly simplifies their task, and hence tends to oppose technologies such as XTP because they result in irregular behaviors that are harder to optimize in the general case. For these researchers, knowing more about the low level requirements (and keeping them as simple as possible) makes it more practical to optimize the corresponding code paths for extremely high performance and low latency.
From a reliability perspective, one must sympathize with both points of view: this textbook will look at problems for which reliability requires high performance or other guarantees, and problems for which reliability implies the need to monitor, control, or manage a complex environment. If there is a single factor that prevents a protocol suite such as XTP from "sweeping the industry", it seems likely to be this. More likely, however, is an increasingly diverse collection of low-level protocols, creating ongoing challenges for the community that must administer and monitor the networks in which those protocols are used.
There is a sense in which it is not surprising that problems such as the performance anomalies cited in the previous sections would be common on modern operating systems, because the communication subsystems have rarely been designed or tuned to guarantee good performance for communication patterns such as were used to produce Figure 3-2. As will be seen in the next few chapters, the most common communication patterns are very regular ones that would not trigger the sorts of pathological behaviors caused by memory resource limits and stressful communication loads.
However, given a situation in which most systems must in fact operate over protocols such as TCP and UDP, these behaviors do create a context that should concern students of distributed systems reliability. They suggest that even systems that behave well most of the time may break down catastrophically because of something as simple as a slight increase in load. Software designed on the assumption that message loss rates are low may, for reasons completely beyond the control of the developer, encounter loss rates that are extremely high. All of this can lead the researcher to question the appropriateness of modern operating systems for reliable distributed applications. Alternative operating systems architectures that offer more controlled degradation in the presence of excess load represent a potentially important direction for investigation and discussion.
On the Internet protocols: [Tan88, Com91, CS91, CS93, CDK94]. Performance issues for TCP and UDP: [Com91, CS91, CS93, ALFxx, KP93, PP93, BMP94, Hun95]. IP Multicast: [FWB85, Dee88, Dee89, DC90, Hun95]. Active Messages: [ECGS92, EBBV95]. End-to-end argument: [SRC84]. Xpress Transfer Protocol: [SDW92, XTP95, DFW90].
The emergence of "real" distributed computing systems is often identified with the client-server paradigm, and a protocol called remote procedure call which is normally used in support of this paradigm. The basic idea of a client-server system architecture involves a partitioning of the software in an application into a set of services, which provide a set of operations to their users, and client programs, which implement applications and issue requests to services as needed to carry out the purposes of the application. In this model, the application processes do not cooperate directly with one another, but instead share data and coordinate actions by interacting with a common set of servers, and by the order in which the application programs are executed.
There are a great number of client-server system structures in a typical distributed computing environment. Some examples of servers include the following:
In most distributed systems, services can be instantiated multiple times. For example, a distributed system can contain multiple file servers, or multiple name servers. We normally use the term service to denote a set of servers. Thus, the network file system service consists of the network file servers for a system, and the network information service is a set of servers, provided on UNIX systems, that map symbolic names to ascii strings encoding "values" or addresses. An important question to ask about a distributed system concerns the binding of applications to servers.
We say that a binding occurs when a process that needs to talk to a distributed service becomes associated with a specific server that will perform requests on its behalf. Various binding policies exist, differing in how the server is selected. For an NFS distributed file system, binding is a function of the file pathname being accessed – in this file system protocol, the servers all handle different files, so that the pathname maps to a particular server that owns that file. A program using the UNIX network information server normally starts by looking for a server on its own machine. If none is found, the program broadcasts a request and binds to the first NIS that responds, the idea being that this NIS representative is probably the least loaded and will give the best response times. (On the negative side, this approach can reduce reliability: not only will a program now be dependent on availability of its file servers, but it may be dependent on an additional process on some other machine, namely the NIS server to which it became bound). The CICS database system is well known for its explicit load-balancing policies, which bind a client program to a server in a way that attempts to give uniform responsiveness to all clients.
Algorithms for binding, and for dynamically rebinding, represent an important topic to which we will return in Chapter 17, once we have the tools at our disposal to solve the problem in a concise way.
A distributed service may or may not employ data replication, whereby a service maintain more than one copy of a single data item to permit local access at multiple locations, or to increase availability during periods when some server processes may have crashed. For example, most network file services can support multiple file servers, but do not replicate any single file onto multiple servers. In this approach, each file server handles a partition of the overall file system, and the partitions are disjoint from one another. A file can be replicated, but only by giving each replica a different name, placing each replica on an appropriate file server, and implementing hand-crafted protocols for keeping the replicas coordinated. Replication, then, is an important issue in designing complex or highly available distributed servers.
Caching is a closely related issue. We say that a process has cached a data item if it maintains a copy of that data item locally, for quick access if the item is required again. Caching is widely used in file systems and name services, and permits these types of systems to benefit from locality of reference. A cache hit is said to occur when a request can be satisfied out of cache, avoiding the expenditure of resources needed to satisfy the request from the primary store or primary service. The Web uses document caching heavily, as a way to speed up access to frequently used documents.
Caching is similar to replication, except that cached copies of a data item are in some ways second-class citizens. Generally, caching mechanisms recognize the possibility that the cache contents may be stale, and include a policy for validating a cached data item before using it. Many caching schemes go further, and include explicit mechanisms by which the primary store or service can invalidate cached data items that are being updated, or refresh them explicitly. In situations where a cache is actively refreshed, caching may be identical to replication – a special term for a particular style of replication.
However, "generally" does not imply that this is always the case. The Web, for example, has a cache validation mechanism but does not actually require that web proxies validate cached documents before providing them to the client; the reasoning is presumably that even if the document were validated at the time of access, nothing prevents it from changing immediately afterwards and hence being stale by the time the client display it, in any case. Thus a periodic refreshing scheme in which cached documents are refreshed every half hour or so is in many ways equally reasonable. A caching policy is said to be coherent if it guarantees that cached data is indistinguish to the user from the primary copy. The web caching scheme is thus one that does not guarantee coherency of cached documents.
The most common communication protocol for communication between the clients of a service and the service itself is remote procedure call. The basic idea of an RPC originated in work by Nelson in the early 1980’s [BN84]. Nelson worked in a group at Xerox Parc that was developing programming languages and environments to simplify distributed computing. At that time, software for supporting file transfer, remote login, electronic mail, and electronic bulletin boards had become common. Parc researchers, however, and ambitious ideas for developing other sorts of distributed computing applications, with the consequence that many researchers found themselves working with the lowest level message passing primitives in the Parc distributed operating system, which was called Cedar.
Much like a more modern operating system, message communication in Cedar supported three communication models:
Programmers found these interfaces hard to work with. Any time a program p needed to communicate with a program s, it was necessary for p to determine the network address of s, encode its requests in a way that s would understand, send off the request, and await a reply. Programmers soon discovered that certain basic operations needed to be performed in almost any network application, and that each developer was developing his or her own solutions to these standard problems. For example, some programs used broadcasts to find a service with which they needed to communicate, others stored the network address of services in files or hard-coded them into the application, and still others supported directory programs with which services could register themselves, and supporting queries from other programs at runtime. Not only was this situation confusing, it turned out to be hard to maintain the early versions of Parc software: a small change to a service might "break" all sorts of applications that used it, so that it became hard to introduce new versions of services and applications.
Surveying this situation, Bruce Nelson started by asking what sorts of interactions programs really needed in distributed settings. He concluded that the problem was really no different from function or procedure call in a non-distributed program that uses a presupplied library. That is, most distributed computing applications would prefer to treat other programs with which they interact much as they treat presupplied libraries, with well known, documented, procedural interfaces. Talking to another program would then be as simple as invoking one of its procedures – a remote procedure call (RPC for short).
The idea of remote procedure call is compelling. If distributed computing can be transparently mapped to a non-distributed computing model, all the technology of non-distributed programming could be brought to bear on the problem. In some sense, we would already know how to design and reason about distributed programs, how to show them to be correct, how to test and maintain and upgrade them, and all sorts of preexisting software tools and utilities would be readily applicable to the problem.
Unfortunately, the details of supporting remote procedure call turn out to be non-trivial, and some aspects result in "visible" differences between remote and local procedure invocations. Although this wasn’t evident in the 1980’s when RPC really took hold, the subsequent ten or fifteen years saw considerable theoretical activity in distributed computing, out of which ultimately emerged a deep understanding of how certain limitations on distributed computing are reflected in the semantics, or properties, of a remote procedure call. In some ways, this theoretical work finally lead to a major breakthrough in the late 1980’s and early 1990’s, when researchers learned how to create distributed computing systems in which the semantics of RPC are precisely the same as for local procedure call (LPC). In Part III of this text, we will study the results and necessary technology underlying such a solution, and will see how to apply it to RPC. We will also see, however, that such approaches involve subtle tradeoffs between semantics of the RPC and performance that can be achieved, and that the faster solutions also weaken semantics in fundamental ways. Such considerations ultimately lead to the insight that RPC cannot be transparent, however much we might wish that this was not the case.
Making matters worse, during the same period of time a huge engineering push behind RPC elevated it to the status of a standard – and this occurred before it was understand how RPC could be made to accurately mimic LPC. The result of this is that the standards for building RPC-based computing environments (and to a large extent, the standards for object-based computing that followed RPC in the early 1990’s) embody a non-transparent and unreliable RPC model, and that this design decision is often fundamental to the architecture in ways that the developers who formulated these architectures probably did not appreciate. In the next chapter, when we study stream-based communication, we will see that the same sort of premature standardization affected the standard streams technology, which as a result also suffer from serious limitations that could have been avoided had the problem simply been better understood at the time the standards were developed.
In the remainder of this chapter, we will focus on standard implementations of RPC. We will look at the basic steps by which an program RPC is coded in a program, how that program is translated at compile time, and how it becomes bound to a service when it is executed. Then, we will study the encoding of data into messages and the protocols used for service invocation and to collect replies. Finally, we will try to pin down a semantics for RPC: a set of statements that can be made about the guarantees of this protocol, and that can be compared with the guarantees of LPC.
We do not, however, give detailed examples of the major RPC programming environments: DCE and ONC. These technologies, which emerged in the mid 1980’s, represented proposals to standardize distributed computing by introducing architectures within which the major components of a dtsributed computing system would have well-specified interfaces and behaviors, and within which application programs could interoperate using RPC by virtue of employing standard RPC interfaces. DCE, in particular, has become relatively standard, and is available on many platforms today [DCE94]. However, in the mid-1990’s, a new generation of RPC-oriented technology emerged through the Object Management Group, which set out to standardize object-oriented computing. In a short period of time, the CORBA [OMG91] technologies defined by OMG swept past the RPC technologies, and for a text such as the present one, it now makes more sense to focus on CORBA, which we discuss in Chapter 6. CORBA has not so much changed the basic issues, as it has broadened the subject of discourse by covering more kinds of system services than did previous RPC systems. Moreover, many CORBA systems are implemented as a layer over DCE or ONC. Thus, although RPC environments are important, they are more and more hidden from typical programmers and hence there is limited value in seeing examples of how one would program applications using them directly.
Many industry analysis talk about CORBA implemented over DCE, meaning that they like the service definitions and object orientation of CORBA, and that it makes sense to assume that these are build using the service implementations standardized in DCE. In practice, however, CORBA makes as much sense on a DCE platform as on a non-DCE platform, hence it would be an exaggeration to claim that CORBA on DCE is a de-facto standard today, as one sometimes reads in the popular press.
The use of RPC leads to interesting problems of reliability and fault-handling. As we will see, it is not hard to make RPC work if most of the system is working well. When a system malfunctions, however, RPC can fail in ways that leave the user with no information at all about what has occurred, and with no apparent strategy for recovering from the situation.There is nothing new about the situations we will be studying – indeed, for many years, it was simply assumed that RPC was subject to intrinsic limitations, and that there being no obvious way to improve on the situation, there was no reason that RPC shouldn’t reflect these limitations in its semantic model. As we advance through the book, however, and it becomes clear that there are realistic alternatives that might be considered, this point of view becomes increasingly open to question.
Indeed, it may now be time to develop a new set of standards for distributed computing. The existing standards are flawed, and the failure of the standards community to repair these flaws has erected an enormous barrier to the development of reliable distributed computing systems. In a technical sense, these flaws are not tremendously hard to overcome – although the solutions would require some reengineering of communication support for RPC in modern operating systems. In a practical sense, however, one wonders if it will take a "Tacoma Narrows" event to create real industry interest in taking such steps.
One could build an RPC environment that would have few, if any, user-visible incompatibilities from a more fundamentally rigorous approach. The issue then is one of education – the communities that control the standards need to understand the issue better, and need to understand the reasons that this particular issue represents such a huge barrier to progress in distributed computing. And, the community needs to recognize that the opportunity vastly outweighs the reengineering costs that would be required to seize it. With this goal in mind, let’s take a close look at RPC.
The programmer of an RPC-based application employs what is called a stub generation tool. Such a tool is somewhat like a macro preprocessor: it transforms the user’s original program into a modified version, which can be linked to an RPC runtime library.
From the point of view of the programmer, the server or client program looks much like any other program. Normally, the program will import or export a set of interface definitions, covering the remote procedures that will be obtained from remote servers or offered to remote clients, respectively. A server program will also have a "name" and a "version", which are used to connect the client to the server. Once coded, the program is compiled in two stages: first the stub generator is used to map the original program into a standard program with added code to carry out the RPC, and then the standard program is linked to the RPC runtime library for execution.
RPC-based application or server programs are coded in a programming style very similar to a non-distributed program written in C for UNIX: there is no explicit use of message passing. However, there is an important aspect of RPC programming that differs from programming with local procedure calls: the separation of the service interface definition, or IDL, from the code that implements it. In an RPC application, a service is considered to have two parts. The interface definition specifies the way that the service can be located (its name), the data types used in issuing requests to it, and the procedure calls that it supports. A version number is included to provide for evolution of the service over time – the idea being that if a client is developed to use version 1.1 of a service, there should be a way to check for compatibility if it turns out that version 1.0 or 2.3 is running when the client actually gets executed.
The basic actions of the RPC library were described earlier. In the case of a server program, the library is responsible for registering the program with the RPC directory service program, which is normally provided as part of the RPC runtime environment. An RPC client program will automatically perform the tasks needed to connect query the directory to find this server and to connect to it, creating a client-server binding. For each of the server operations it invokes, code will be executed to marshall a representation of the invocation into a message – that is, information about the way that the procedure was called and values of the parameters that were passed. Code is included to send this message to the service, and to collect a reply; on the server side, the stub generator creates code to read in such a message, invoke the appropriate procedure with the arguments used by the remote caller, and to marshall the results for transmission back to the caller. Issues such as user-id handling, security and privacy, and handling of exceptions are often packaged as part of a solution. Finally, back on the caller side, the returning message will be demarshalled and the result made to look like the result of a local procedure.
Although much of this mechanism is automatic and hidden from the programmer, RPC programming differs from LPC programming in many ways. Most noticeable is that most RPC packages limit the types of arguments that can be passed to a remote server, and some also limit the size (in bytes) of the argument information. For example, suppose that a local procedure is written to search a list, and an LPC is performed to invoke this procedure, passing a pointer to the head of the list as its argument. One can ask whether this should work in an RPC environment – and if so, how it can be supported. If a pointer to the head of the list is actually delivered to a remote program, that pointer will not make sense in the remote address space where the operation will execute. So, it would be natural to propose that the pointer be dereferenced, by copying the head of the list into the message. Remotely, a pointer to the copy can be provided to the procedure. Clearly, however, this will only work if one chases all the pointers in question – a problem because many programs that use pointers have some representation for an uninitialized pointer, and the RPC stub generator may not know about this.
For example, in building a balanced tree, it is common to allocate nodes dynamically as items are inserted. A node that has no descendents would still have left and right pointer fields, but these would be initialized to nil and the procedure to search nodes would check for the nil case before dereferencing these pointers. Were an RPC marshalling procedure to automatically make a copy of a structure to send to the remote server, it would need to realize that for this particular structure, a pointer value of nil has a special meaning and should not be "chased".
The RPC programmer sees issues such as these as a set of restrictions. Depending on the RPC package used, different approaches may be used to attack them. In many packages, pointers are simply not legal as arguments to remote procedures. In others, the user can control a copying mechanism to some degree, and in still fancier systems, the user must provide general purpose structure traversal procedures that will be used by the RPC package to marshall arguments. Further complications can arise if a remote procedure may modify some of its arguments. Again, the degree to which this is supported at all, and the degree to which the programmer must get involved, vary from package to package.
Perhaps ironically, RPC programmers tend to complain about this aspect of RPC no matter how it is handled. If a system is highly restrictive, the programmer finds that remote procedure invocation is annoying because one is constantly forced to work around the limitations of the invocation package. For example, if an RPC package imposes a size limit on the arguments to a procedure, an application that works perfectly well in most situations may suddenly fail because some dynamically defined object has grown too large to be accepted as an RPC parameter. Suddenly, what was a single RPC becomes a multi-RPC protocol for passing the large object in chunks, and a perfectly satisfied programmer has developed distinct second thoughts about the transparency of RPC. At the other extreme are programming languages and RPC packages in which RPC is extremely transparent. These, however, often incur high overheads to copy information in and out, and the programmer is likely to be very aware of these because of their cost implications. For example, a loop that repeatedly invokes a procedure with one parameter changing and others (including a pointer to some large object) may be quite inexpensive to invoke in the local case. But if the large object will be copied to a remote program on every invocation, the same loop may cost a fortune when coded as part of a distributed client-server application, forcing the program to be redesigned to somehow pass the object to the remote server prior to the computational loop. These sorts of issues, then, make programming with RPC quite different from programming with LPC.
RPC also introduces error cases that are not seen in LPC, and the programmer needs to deal with these. An LPC would never fail with a "binding error", or a "version mismatch" or a "timeout." In the case of RPC, all of these are possibilities – a binding error would arise if the server is not running when the client is started. A version mismatch might occur if a client was compiled against version 1 of a server, but the server has now been upgraded to version 2. A timeout could result from a server crash, or a network problem, or even a problem on the client’s computer. Many RPC applications would view these sorts of problems as unrecoverable errors, but fault-tolerant systems will often have alternative sources for critical services and will need to fail-over from a primary server to a backup. The code to do this is potentially complex, and in most RPC environments, must be implemented by the application developer on a case-by-case basis.
The binding problem arises when an RPC client program needs to determine the network address of a server capable of providing some service it requires. Binding can be approached from many perspectives, but the issue is simplified if issues associated with the name service used are treated separately, as we do here.
Disregarding its interactions with the name service, a binding service is primarily a protocol by which the RPC system verifies compatibility between the client and server and establishes any connections needed for communication.
The compatibility problem is important in systems that will operate over long periods of time, during which maintenance and the development of new versions of system components will inevitably occur. Suppose that a client program c was developed and tested using server s, but that we now wish to install a new version of s, c, or both. Upgrades such as these create a substantial risk that some old copy of c will find itself talking to a new copy of s, or vice versa. For example, in a network of workstations it may be necessary to reload c onto the workstations one by one, and if some machines are down when the reload occurs, an old copy of c could remain on its disk. Unless c is upgraded as soon as the machine is rebooted – and this may or may not occur, depending on how the system is administered – one would find an old c talking to an upgraded s. It is easy to identify other situations in which problems such as this could arise.
It would be desirable to be able to assume that all possible "versions" of s and c could somehow communicate with all other versions, but this is not often the case. Indeed, it is not necessarily even desirable. Accordingly, most RPC environments support a concept of version number which is associated with the server IDL. When a client program is compiled, the server IDL version is noted in software. This permits the inclusion of the client’s version of the server interface directly in the call to the server. When the match is not exact, the server could reject the request as being incompatible, perform some operation to map the old-format request to a new-format request, or even preserve multiple copies of its functionality, running the version matched to the caller.
Connection establishment is a relatively mechanical stage of binding. Depending on the type of client-server communication protocol that will be used, messages may be transmitted using unreliable datagrams or over reliable communication streams such as X.25 or TCP. Unreliable datagram connections normally do not require any initial setup, but stream connections typically involve some form of open or initialization operation. Having identified the server to which a request will be issued, the binding mechanism would normally perform this open operation.
The binding mechanism is sometimes used to solve two additional problems. The first of these is called the "factory" problem, and involves starting a server when a service has no currently operational server. In this approach, the first phase of binding looks up the address of the server and learns that the server is not currently operational (or, in the connection phase, a connection error is detected and from this the binder deduces that the server has failed). The binder then issues a request to a factory in which the system designer has stored instructions for starting a server up when needed. After a suitable pause, the binder cycles back through its first phase, which presumably succeeds.
The second additional problem arises in the converse situation, when the binder discovers multiple servers that could potentially handle this client. The best policy to use in such situations depends very much on the application. For some systems, a binder should always pick a server on the same machine as the client, if possible, and should otherwise pick randomly. Other systems require some form of load-balancing, while still others may implement an affinity policy under which a certain server might be especially well suited to handling a particular client for reasons such as the data it has cached in memory, or the type of requests the client is expected to issue once binding has been completed.
Binding is a relatively expensive operation. For example, in the DCE RPC environment, binding can be more than 10 times as costly as RPC. However, since binding only occurs once for each client-server pair, this high cost is not viewed as a major problem in typical distributed computing systems.
The purpose of a data marshalling mechanism is to represent the caller’s arguments in a way that can be efficiently interpreted by a server program. In the most general cases, this mechanism deals with the possibility that the computer on which the client is running uses a different data representation than the computer on which the server is running.
Mashalling has been treated at varying levels of generality, and in fact there exists a standard, ASN.1, for self-describing data objects in which a specific representation is recommended. In addition to ASN.1, several major vendors have adopted data representations of their own, such as SUN Microsystem’s External Data Representation (XDR) format, which is used in the widely popular Network File System (NFS) protocol.
The basic issues that arise in a data marshalling mechanism, then, are these. First, integer representations vary for the most common CPU chips. On some chips the most significant byte of an integer is also the low byte of the first word in memory, while on others the most significant byte is stored in the high byte of the last word of the integer. These are called little-endian and big-endian representations. At one point in the 1980’s, computers with other representations – other byte permutations – were on the market, but at the time of this writing the author is not aware of any other surviving formats.
A second representation issue concerns data alignment. Some computers require that data be aligned on 32-bit or even 64-bit boundaries, while others may have weaker alignment rules, for example by supporting data alignment on 16-bit boundaries. Unfortunately, such issues are extremely common. Compilers know about these rules, hence the programmer is typically unaware of them. However, when a message arrives from a remote machine that may be using some other alignment rule, the issues becomes an important one. An attempt to fetch data directly from a message without attention to this issue could result in some form of machine fault, or could result in retrieval of garbage. Thus, the data representation used in messages must encode sufficient information to permit the destination computer to find the start of object in the message, or the sender and destination must agree in advance on a packed representation that will be used for messages "on the wire" even if the sender and destination themselves share the same rules and differ from the standard. Needless to say, this is a topic capable of generating endless and fascinating debate among computer vendors whose machines use different alignment or data representations.
A third issue arises from the existence of multiple floating point representations. Although there is an IEEE standard floating point representation, which has become widely accepted, some computer vendors use non-standard representations for which conversion would be required, and even within computers using the standard, byte ordering issues can still arise.
A forth issue concerns pointers. When transmitting a complex structure in which there are pointers, the marshalling mechanism needs to either signal that the user has requested something illegal, or to somehow represent these pointers in a way that will permit the receiving computer to "fix" them upon reception of the request. This is especially tricky in languages like LISP, which require pointers and hence cannot easily legislate against them in RPC situations. On the other hand, passing pointers raises additional problems: should the pointed-to object be included in the message, transferred only upon use (a "lazy" scheme), or handled in some other way?
Finally, a marshalling mechanism may need to deal with incompatibilities in the basic data types available on computers. For example, a pair of computers supporting 64-bit integers in hardware may need to exchange messages containing 64-bit integer data. The marshalling scheme should therefore be able to represent such integers. On the other hand, when this type of message is sent to a computer that uses 32-bit integers the need arises to truncate the 64-bit quantities so that they will fit in the space available, with an exception being generated if data would be lost by such a truncation. Yet, if the message is merely being passed through some sort of intermediary, one would prefer that data not be truncated, since precision would be lost. In the reverse direction, sign extension or padding may need to be performed to convert a 32-bit quantity into an equivalent 64-bit quantity, but only if the data sent is a signed integer. Thus, a completely general RPC package needs to put a considerable amount of information into each packet, and may need to do quite a bit of work to represent data in a universal manner. On the other hand, such an approach may be much more costly than one that supports only a very limited set of possible representations, or that compiles the data marshalling and demarshalling operations directly into inline code.
The approach taken to marshalling varies from RPC package to package. SUN’s XDR system is extremely general, but requires the user to code marshalling procedures for data types other than the standard base types of the system. With XDR, one can represent any desired data structure, even dealing with pointers and complex padding rules. At the other end of the spectrum are marshalling procedures that transmit data in the binary format used by the sender, are limited to only simple data types, and perhaps do little more than compatibility checking on the receive side. Finally, schemes like ISDN.1 are often used with RPC stub generators, which automatically marshall and demarshall data, but impose some restrictions on the types of objects that can be transmitted.
As a general rule of thumb, users will want to be aware that the more general solutions to these problems are also more costly. If the goal is extremely speed, it may make sense to design the application itself to produce data in a form that is inexpensive to marshall and demarshall. The cost implications of failing to do so can be surprising, and in many cases, it is not even difficult to redesign an interface so that RPC to it will be cheap.
No RPC system lives in isolation. As we will see below, RPC is often integrated with a security mechanism, and because security keys (and some parts of the RPC protocol itself) use timestamps, with a clock synchronization mechanism. For this reason, one often talks about distributed computing "environments" that include tools for implementing client-server applications including an RPC mechanism, security services and time services. Elaborate environments may go well beyond this, including system instrumentation and management interfaces and tools, fault-tolerance tools, and so-called Forth Generation Language (4GL) tools for building applications using graphical user interfaces (GUI’s). Such approaches can empower even unskilled users to develop sophisticated distributed solutions. In this section we briefly review the most important of these services.
A naming service maintains one or more mappings from some form of name (normally symbolic) to some form of value (normally, a network address). Naming services can operate in a very narrow, focused way – for example, the Domain Naming Service of the TCP/IP protocol suite maps short service names, in ascii, to IP addresses and port numbers, requiring exact matches. At the other extreme, one can talk about extremely general naming services that are used for many sorts of data, allow complex pattern matching on the name, and may return other types of data in addition to, or instead of, an address. One can even go beyond this, to talk about secure naming services that can be trusted to only give out validated addresses for services, very dynamic naming services that deal with applications like mobile computing systems in which hosts have addresses that change constantly, and so forth.
In standard computer systems at the time of this writing, three naming services are widely supported and used. Mentioned above, the Domain Name Service (DNS) is the least functional but most widely used. It responds to requests on a standard network port address, and for the "domain" in which it is running can map short (8 character) strings to internet port numbers. DNS is normally used for static services, which are always running when the system is operational and do not change port numbers at all. For example, the email protocol uses DNS to find the remote mail daemon capable of accepting incoming email to a user on a remote system.
The Network Information Service (NIS), previously called Yellow Pages (YP), is considerably more elaborate. NIS maintains a collection of maps, each of which has a symbolic name (e.g. "hosts", "services", etc.) and maps ascii keywords to an ascii value string. NIS is used on UNIX systems to map host names to internet addresses, service names to port numbers, etc. Although NIS does not support pattern matching, there are ways for an application to fetch the entire NIS database, one line at a time, and it is common to include multiple entries in an NIS database for a single host that is known by a set of aliases. NIS is a distributed service that supports replication: the same data is normally available from any of a set of servers, and a protocol is used to update the full set of servers if an entry changes. However, NIS is not designed to support rapid updates: the assumption is that NIS data consists of mappings like the map from host name to internet address, which change very rarely. A 12-hour delay before NIS information is updated is not unreasonable given this model, hence the update problem is solved by periodically refreshing the state of each NIS server by having it re-read the contents of a set of files in which the mapping data is actually stored. As an example, NIS is often used to store password information on UNIX systems.
X.500 is an international standard that many expect will eventually replace NIS. This service, which is designed for use by applications running the ISO standard remote procedure call interface and ISDN.1 data encoding, operates much like an NIS server. No provision has been made in the standard for replication or high performance update, but the interface does support some limited degree of pattern matching. As might be expected from a standard of this sort, X.500 addresses a wide variety of issues, including security and recommended interfaces. However, reliability issues associated with availability and consistency of the X.500 service (i.e. when data is replicated) have not yet been tackled by the standards organization.
Looking to the future, there is considerable interest in using X.500 to implement general purpose White-Pages (WP) servers, which would be explicitly developed to support sophisticated pattern matching on very elaborate databases with detailed information about abstract entities. Rapid update rates, fault-tolerance features, and security are all being considered in these proposals. At the time of this writing, it appears that the Web will require such services and hence that the work on universal resource naming for use in the Web will be a major driving force for evolution in this overall area.
With the launch of the so-called Global Positioning System satellites, micro-second accuracy become possible in workstations equipped with inexpensive radio receivers. Unfortunately, however, accurate clocks remain a major problem in the most widely used computer workstations and network technologies. We will have a great to say about this in Chapter 20, but some background may still be useful here.
At the time of this writing, the usual clock for a PC or workstation consists of a quartz-based chip much like the one in a common wristwatch, accurate to within a few seconds per year. The initial value of such a clock is either set by the vendor or by the user, when the computer is booted. As a result, in any network of workstations, clock can give widely divergent readings and can drift with respect to one-another at significant rates. For these reasons, there has been considerable study of algorithms for clock synchronization, whereby the clocks on invidual machines can be adjusted to give behavior approximating that of a shared global clock. In Chapter 20, we will discuss some of the algorithms that have been proposed for this purpose, their ability to tolerate failures, and the analyses used to arrive at theoretical limits on clock accuracy.
However, much of this work has a limited lifetime. GPS receivers can give extremely accurate time, and GPS signals are transmitted frequently enough so that even inexpensive hardware can potentially maintain time accurate to microseconds. By broadcasting GPS time values, this information can be propagated within a network of computers, and although some accuracy is necessarily lost when doing so, the resulting clocks are still accurate and comparable to within tens of microseconds. This development can be expected to have a major impact on the way that distributed software is designed – from a world of asynchronous communication and clocks that can be inaccurate by many times the average message latency in the network, GPS based time could catapult us into a domain in which clock resolutions considerably exceed the average latency between sending a message and when it is received. Such developments make it very reasonable to talk about synchronous (time-based) styles of software design and the use of time in algorithms of all sorts.
Even coarsely synchronized clocks can be of value in distributed software. For example, when comparing versions of files, microsecond accuracy is not needed to decide if one version is more current than another: accuracy of seconds or even tens of seconds may be adequate. Security systems often have a notion of expiration associated with keys, but for these to be at risk of "attacks" an intruder would need a way to set a clock back by days, not fractions of a second. And, although we will see that RPC protocols use time to detect and ignore very old, stale, messages, as in the case of a security mechanism a clock would need to be extremely inaccurate for such a system to malfunction.
In the context of an RPC environment, security is usually concerned with the authentication problem. Briefly stated, this is the problem of providing applications with accurate information about the user-id on behalf of which a request is being performed. Obviously, one would hope that the user-id is related in some way to the user, although this is frequently the weak link in a security architecture. Given an accurate source of user identifications, however, the basic idea is to avoid intrusions that can compromise user-id security through break-ins on individual computers and even replacements of system components on some machines with versions that have been compromised and hence could malfunction. As in the case of clock services, we will looking more closely at security later in the textbook (Chapter 19) and hence limit ourselves to a brief review here.
To accomplish authentication, a typical security mechanism (for example, the Kerberos security architecture for DCE [SNS88, Sch94]) will request some form of password or one-time key from the user at login time, and periodically thereafter, as keys expire on the basis of elapsed time. This information is used to compute a form of secure user-identification that can be employed during connection establishment. When a client binds to a server, the security mechanism authenticates both ends, and also (at the option of the programmer) arranges for data to be encrypted on the wire, so that intruders who witness messages being exchanged between the client and server have no way to decode the data contained within them. (Unfortunately, however, this step is so costly that many applications disable encryption and simply rely upon the security available from the initial connection setup). Notice that for such a system to work correctly, there must be a way to "trust" the authentication server itself: the user needs a way to confirm that it is actually talking to the authentication server, and to legitimate representatives of the services it wishes to use. Give the anonymity of network communication, these are potentially hard problems.
In Chapter 19, we will look closely at distributed security issues (for example, we will discuss Kerberos in much more detail), and also at the relationship between security and other aspects of reliability and availability – problems that are often viewed as mutually exclusive since one replicates information to make it more available, but would tend to restrict and protect it to make it more secure. We will also look at emerging techniques for protecting privacy, namely the "true" user-id’s of programs active in a network. Although the state of the art does not yet support construction of high performance, secure, private applications, this should be technically feasible within the not-distant future. Of course, technical feasibility does not imply that the technology will become widely practical and hence useful in building reliable applications, but at least the steps needed to solve the problems are increasingly understood.
Yet a fourth component of a typical RPC system is the lightweight threads package, which enables a single program to handle multiple tasks at the same time. Although threads are a general concept and indeed have rather little to do with communication per-se, they are often viewed as necessary in distributed computing systems because of the potential for deadlock if threads are not present.
To understand this point, it is helpful to contrast three ways of designing a communication system. A single-threaded message-based approach would correspond to a conventional style of programming extended directly to message passing. The programmer would use system calls like sendto and recvfrom as desired to send and receive messages. If there are several things happening at the same time in a program structured this way, however, the associated bookkeeping can be a headache (see Figure 4-3).
Threads offer a simple way to eliminate this problem: each thread executes concurrently with the others, and each incoming request spawns a new thread to handle it. While an RPC is pending the thread that issues it blocks (waits) in the procedure call that invoked the RPC. To the degree that there is any bookkeeping to worry about, the associated state is represented directly in the local variables of this procedure and in the call itself: when the reply is received, the procedure returns (the thread resumes execution), and there is no need to track down information about why the call was being done: this is "obvious" to the calling procedure. Of course, the developer does need to implement adequate synchronization to avoid concurrency-related bugs, but in general this is not a hard thing to do. The approach overcomes many forms of problems that are otherwise hard to address.
For example, consider a situation in which an RPC server is also the client of some other server, which is in turn the client of still additional servers. It is entirely possible that a cycle could form, in which RPC a by process x on process y leads to an RPC b by y on z, and so forth, until finally some process in the chain makes a request back to the original process, x. If these calls were LPC calls, such a sequence would simply be a form of recursion. For a single-threaded RPC system, however, x will be busy performing RPC a and hence would be unresponsive, creating a deadlock. Alternatively, x would need to somehow save the information associated with sending RPC a while it is handling this new incoming request. This is the bookkeeping problem aluded to above.
Yet a third option is known as "event dispatch" and is typical of windowing systems, in which each action by the user (mouse motion or clicks, keyboard entries) results in delivery of an "event" record to a central dispatching loop. The application program typically registers a set of procedure callbacks to perform when events of interest are received: if the left mouse button is pressed, invoke left_button(). Arguments to these callbacks tell the program exactly what occured: the cursor was at position 132,541 when the mouse button was pressed, this is inside such and such a window, etc. One can use the same approach to handle event dispatch in message-based systems: incoming messages are treated as "events" and result in callbacks to handler procedures.
The approaches can also be combined: event dispatch systems can, for example, fork a new thread for each incoming message. In the most general approach, the callback is registered with some indication of how it should be performed: by forking a thread, by direct procedure call, or perhaps even by some other method, such as enqueuing the event on an event queue. This last approach is used in the Horus system, which we will discuss in Chapter 18.
At the time of this writing, although this is not universally the case, many RPC systems are built directly over a lightweight threads package. Each incoming RPC is handled by a new thread, eliminating the risk of deadlock, but forcing the programmer to learn about lightweight threads, preemption, mutual exclusion mechanisms, and other issues associated with concurrency. In this text, we will present some protocols that in which processes are assumed to be multi-threaded, so that the initiator of a protocol can also be a participant in it. However, we will not explicitly discuss thread packages or make use of any special features of particular packages.
The use of threads in this manner remains debatable. UNIX programs have heavily favored this approach, and the UNIX community generally understands the issues that must be addressed and minimizes their difficulty. Indeed, with experience, threaded programming is not all that hard. One merely needs to get in the habit of enforcing necessary synchronization using appropriate interlocks. However, the PC community tends to work with an event-based model that lacks threads, in which the application is visualized as a dispatcher for incoming events and all callbacks are by procedure invocation. Thus, the PC community has its own style of programming, and it is largely non-threaded. Windows NT further complicates this picture: it supports threads, and yet uses an event-oriented style of displatching throughout the operating system; if a user wants to create a thread to handle an event, this is easily done but not "forced" upon the programmer.
The discussion up to this point has focused on client/server computing and RPC from the perspective of the user. A remote procedure call protocol is concerned with the actual mechanism by which the client process issues a request to a server, and by which the reply is transmitted back from the server to the client. We now look at this protocol in more detail.
Abstractly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" – a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs.
Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages, and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures.
Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcoming limited levels of message loss, disorder and even duplication would be easy to develop (Figure 4-4). For each process to which it issues requests, a client process maintains a message sequence number. Each message transmitted carries a unique sequence number, and (in most RPC protocols) a time stamp from a global clock – one that returns roughly the same value throughout the network, up to clock synchronization limits. This information can be used by the server to detect very old or duplicate copies of messages, which are discarded, and to identify received messages using what are called acknowledgment protocol-messages.
The basic idea, then, is that the client process transmits its request and, until acknowledgments have been received, continues to retransmit the same messages periodically. The server collects messages and, when the full request has been received, performs the appropriate procedure invocation. When it transmits its reply, the same sort of reliable communication protocol is used. Often, the acknowledgement is delayed briefly in the hope that the reply will be sent soon, and can be used in place of a separate acknowledgement.
A number of important optimizations have been proposed by developers of RPC-oriented distributed computing environments. For example, if one request will require the transmission of multiple messages, because the request is large, it is common to inhibit the sending of acknowledgments during the transmission of the burst of messages. In this case, a negative acknowledgement is sent if the receiver detects a missing packet; a single ack confirms reception of the entire burst when all packets have been successfully received (Figure 4-5). Similarly, it is common to delay the transmission of acknowledgment packets in the hope that the reply message itself can be transmitted instead of an acknowledgment: obviously, the receipt of a reply implies that the corresponding request was delivered and executed.
Process and machine failures, unfortunately, render this very simple approach inadequate. The essential problem is that because communication is over unreliable networking technologies, when a process is unable to communicate with some other process, there is no way to determine whether the problem is a network failure, a machine failure, or both (if a process fails but the machine remains operational the operating system will often provide some status information, permitting this one case to be accurately sensed).
When an RPC protocol fails by timing out, but the client or server (or both) remains operational, it is impossible to know what has occurred. Perhaps the request was never received, perhaps it was received and executed but the reply was lost, and perhaps the client or server crashed while the protocol was executing. This creates a substantial challenge for the application programmer who wishes to build an application that will operate reliably despite failures of some of the services upon which it depends.
A related problem concerns the issue of what are called exactly once semantics. When a programmer employs LPC, the invoked procedure will be executed exactly once for each invocation. In the case of RPC, however, it is not evident that this problem can be solved. Consider a process c that issues an RPC to a service offered by process s. Depending upon the assumptions we make, it may be very difficult even to guarantee that s performs this request at most once. (Obviously, the possibility of a failure precludes a solution in which s would perform the operation exactly once).
To understand the origin of the problem, consider the possible behaviors of an arbitrary communication network. Messages can be lost in transmission, and as we have seen this can prevent process c from accurately detecting failures of process s. But, the network might also misbehave by delivering a message after an unreasonably long delay. For example, suppose that a network router device fails by jamming up in such a manner that until the device is serviced, the software within it will simply wait for the hardware to be fixed. Obviously, there is no reason to simply assume that routers won’t behave this way, and in fact it is known that some routers definitely could behave this way. Moreover, one can imagine a type of attack upon a network in which an intruder records messages for future replay.
One could thus imagine a situation in which process s performs a request from c, but then is presented with the same request after a very long delay (Figure 4-6). How can process s recognize this as a duplicate of the earlier request?
Depending upon the specific protocol used, an RPC package can use a variety of barriers to protect itself against replays of long-delayed messages. For example, the package might check timestamps in the incoming messages, rejecting any that are very old. Such an approach, however, presumes that clocks are synchronized to a reasonable degree and that there is no danger that a message will be replayed with a modified timestamp – an action that might be well within the capabilities of a sophisticated intruder. The server could use a connect-based binding to its clients, but this merely pushes the same problem into the software used to implement network connections – and as we shall see shortly, the same issues arise and remain just as intractable at that level of a system. The server might maintain a list of currently valid users, and could insist that each message be identified by a monotonically increasing sequence number – but a replay could, at least theoretically, reexecute the original binding protocol.
Analyses such as these lead us to two possible conclusions. One view of the matter is that an RPC protocol should take reasonable precautions against replay but not be designed to protect against extreme situations such as replay attacks. In this approach, an RPC protocol might claim to guarantee at most once semantics, meaning that provided that the clock synchronization protocol has not been compromised or some sort of active attack been mounted upon the system, each operation will result in either a single procedure invocation or, if a communication or process failure occurs, in no invocation. An RPC protocol can similarly guarantee at least once semantics, meaning that if the client system remains operational indefinitely, the operation will be performed at least once but perhaps more than once. Notice that both types of semantics come with caveats: conditions (hopefully very unlikely ones) under which the property would still not be guaranteed. In practice, most RPC environments guarantee a weak form of at most once semantics: only a mixture of an extended network outage and a clock failure could cause such systems to deliver a message twice, and this is not a very likely problem.
A different approach, also reasonable, is to assume a very adversarial environment and protect the server against outright attacks that could attempt to manipulate the clock, modify messages, and otherwise interfere with the system. Security architectures for RPC applications commonly start with this sort of extreme position, although it is also common to weaken the degree of protection to obtain some performance benefits within less hostile subsets of the overall computing system. We will return to this issue and discuss it in some detail in Chapter 19.
The uncertainty associated with RPC failure notification and the weak RPC invocation semantics seen on some system pose a challenge to the developer of a reliable distributed application.
A reliable application would typically need multiple sources of critical services, so that if one server is unresponsive or faulty the application can re-issue its requests to another server. If the server behaves as a "read only" information source, this may be an easy problem to solve. However, as soon as the server is asked to deal with dynamically changing information, even if the changes are infrequent compared to the rate of queries, a number of difficult consistency and fault-tolerance issues arise. Even questions as simple as load balancing, so that each server in a service spanning multiple machines will do a roughly equal share of the request processing load, can be very difficult to solve or reason about.
For example, suppose that an application will use a primary-backup style of fault-tolerance, and the requests performed by the server affect its state. The basic idea is that an application should connect itself to the primary, obtaining services from that process as long as it is operational. If the primary fails, the application will "fail over" to the backup. Such a configuration of processes is illustrated in Figure 4-7. Notice that the figure includes multiple client processes, since such a service might well be used by many client applications at the same time.
Consider now the design of a protocol by which the client can issue an RPC to the primary-backup pair such that if the primary performs the operation, the backup learns of the associated state change. In principle, this may seem simple: the client would RPC to the server, which would compute the response and then RPC to the backup, sending it the request it performed, the associated state change, and the reply being returned to the client. Then the primary would return the reply, as show in Figure 4-8.
This simple protocol is, however, easily seen to be flawed if the sorts of problems we discussed in the previous section might occur while it was running [BG95]. Take the issue of timeout. In this solution, two RPC’s occur, one nested within the other. Either of these, or both, could fail by timeout, in which case there is no way to know with certainty what state the system was left in. If, for example, the client sees a timeout failure, there are quite a few possible explanations: the request may have been lost, the reply may have been lost, and either the primary or the primary and the backup may have crashed. Failover to the backup would only be appropriate if the primary is indeed faulty, but there is no accurate way to determine if this is the case, except by waiting for the primary to recover from the failure – not a very "available" approach.
The matter is further complicated by the presence of more than one client. One could easily imagine that different clients could observe different and completely uncorrelated outcomes for requests issued simultaneously but during a period of transient network or computer failures. Thus, one client might see a request performed successfully by the primary, another might conclude that the primary is apparently faulty and try to communicate with the backup, and yet a third may have timed out both on the primary and the backup! We use the term inconsistent in conjunction with this sort of uncoordinated and potentially incorrect behavior. An RPC system clearly is not able to guarantee the consistency of the environment, at least when the sorts of protocols discussed above are employed, and hence reliable programming with RPC is limited to very simple applications.
The line between "easily" solved RPC applications and very difficult ones is not a very clear one. For example, one major type of file server accessible over the network is accessed by an RPC protocol with very weak semantics, which can be visible to users. Yet this protocol, called the NFS (Network File System) protocol, is widely popular and has the status of a standard, because it is easy to implement and widely available on most vendor computing systems. NFS is discussed in some detail in Section 7.3 and so we will be very brief here.
One example of a way in which NFS behavior reflects an underlying RPC issues arises when creating a file. NFS documentation specifies that the file creation operation should return the error code EEXISTS if a file already exists at the time the create operation is issued. However, there is also a case in which NFS can return error EEXISTS even though the file did not exist when the create was issued. This occurs when the create RPC times out, but the request was in fact delivered to the server and was performed successfully. NFS automatically reissues requests that fail by timing out and will retry the create operation, which now attempts to re-execute the request and fails because the file is now present. In effect, NFS is unable to ensure at most once execution of the request, and hence can give an incorrect return code. Were NFS implemented using LPC (as in the "LFS" or "local file system"), this behavior would not be possible.
NFS illustrates one approach to dealing with inconsistent behavior in an RPC system. By weakening the semantics presented to the user or application program, NFS is able to provide acceptable behavior despite RPC semantics that create considerable uncertainty when an error is reported. In effect, the erroneous behavior is simply redefined to be a "feature" of the protocol.
A second broad approach that will interest us here involves the use of agreement protocols by which the components of a distributed system maintain consensus on the status (operational or failed) of one another. A rigorous derivation of the obligations upon such consensus protocols, the limitations on this approach, and the efficient implementation of solutions will be topics of chapters later in this textbook (Section 13.3). Briefly, however, the idea is that any majority of the system can be empowered to "vote" that a minority (often, just a single component) be excluded on the basis of apparently faulty behavior. Such a component is cut off from the majority group: if it is not really faulty, or if the failure is a transient condition that corrects itself, the component will be prevented from interacting with the majority system processes, and will eventually detect that it has been dropped. It can then execute a rejoin protocol, if desired, after which it will be allowed back into the system.
With this approach, failure becomes an abstract event – true failures can trigger this type of event, but because the system membership is a self-maintained property of the system, the inability to accurately detect failures need not be reflected through inconsistent behavior. Instead, a conservative detection scheme can be used, which will always detect true failures while making errors infrequently, in a sense we will make precise in Section 13.9.
By connecting an RPC protocol to a group membership protocol that runs such a failure consensus algorithm, a system can resolve one important aspect of the RPC error reporting problems discussed above. The RPC system will still be unable to accurately detect failures, hence it will be at risk of incorrectly reporting operational components as having failed. However, the behavior will now be consistent throughout the system: if component a observes the failure of component b, than component c will also observe the failure of b, unless c is also determined to be faulty. In some sense, this approach eliminates the notion of failure entirely, replacing it with an event that might be called "exclusion from membership in the system." Indeed, in the case where b is actually experiencing a transient problem, the resulting execution is much like being exiled from one’s country, or like being shunned: b is prevented from communicating with other members of the system and learns this. Conversely, the notion of a majority allows the operational part of the system to initiate actions on behalf of the full membership in the system. "The system" now becomes identified with a rigorous concept: the output of the system membership protocol, which can itself be defined formally and reasoned about using formal tools.
As we move beyond RPC to consider more complex distributed programming paradigms, we will see that this sort of consistency is often required in non-trivial distributed applications. Indeed, there appears to be a dividing line between the distributed applications that give non-trivial coordinated behavior at multiple locations, and those that operate as completely decoupled interacting components, with purely local correctness criteria. The former type of system requires the type of consistency we have encountered in this simple case of RPC error reporting. The latter type of system can manage with error detection based upon timeouts – but is potentially unsuitable for supporting any form of consistent behavior.
A tremendous amount has been written about client-server computing, and several pages of references could easily have been included here. Good introductions into the literature, including more detailed discussions of DCE and ASN.1, can be found in [BN84, Tan88, CS93, CDK94]. On RPC performance, the "classic" reference is [SB89]. Critiques of the RPC paradigm appear in [TR88, BR94]. On the problem of inconsistent failure detection with RPC: [BG95]. Other relevant publications include [BCLF94, BCLF95, BD95, BKT90, BM90, BN84, Bro94, EBBV95, EKO95, GA91, HP94, Jac88, Jac90, MRTR90, Ras86, SB89, TL93]. A good reference to DCE is [DCE94] and to OLE-2 is [Bro94]. Kerberos is discussed in [SNS88, BM90, Sch94].