Failure detection is a fundamental paradigm in the design of fault-tolerant distributed systems. Our prior work on failure detection (Chandra and Toueg JACM96) considered only a simple computing model, in which a process that crashes is not allowed to recover, there are no communication failures, and the network does not partition. We recently extended this work to more general models, in which the following is allowed: (1) processes that crash may eventually recover and resume execution, (2) communication links may fail by intermittently losing messages, and (3) communication links may crash and cause network partitions. These models are more realistic than the one in (CT96), and this significantly extends the applicability of our work to practical systems.
We also considered systems with probabilistic behaviors, i.e., systems where message delays and message losses follow some probability distributions, and studied the Quality of Service (QoS) of failure detectors for such systems. By QoS, we mean a specification that quantifies (a) how fast the failure detector detects actual failures, and (b) how well it avoids false detections. We gave a new failure detector algorithm and showed that, among a large class of failure detectors, it is optimal with respect to several QoS metrics. Given a set of failure detector QoS requirements, we showed how to compute the parameters of our algorithm so that it satisfies these requirements. Simulation results show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. A more detailed description of our recent research papers is given in: http://www.cs.cornell.edu/home/sam/FDpapers.html
[On leave 1999-2000.]
“Failure detection and randomization: a hybrid approach to solve Consensus.” SIAM Journal on Computing 28 (3) (June 1999), 890-903 (with M.K. Aguilera).
“Using the Heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks.” Invited paper in Theoretical Computer Science, special issue on Distributed Algorithms 220 (1) (June 1999), 3-30 (with M.K. Aguilera and W. Chen).
A simple bivalency proof that t-resilient consensus requires t+1 Rounds. Information Processing Letters 71 (3-4) (August 1999), 155-158 (with M.K. Aguilera).
Failure detection and consensus in the crash-recovery model. Distributed Computing 13 (2) (April 2000), 99-125 (with M.K. Aguilera and W. Chen).
On quiescent reliable communication. SIAM Journal on Computing 29 (6) (April 2000), 2040-2073 (with M. K. Aguilera and W. Chen).
Time and space lower bounds for nonblocking implementations. To appear in SIAM Journal on Computing 30 (2), 438-456 (with P. Jayanti and K. Tang).
Revisiting the weakest failure detector for uniform reliable broadcast. Proceedings of the 13th International Symposium on Distributed Computing, Bratislava, Slovak Republic (September 1999), Lecture Notes in Computer Science, Springer-Verlag, vol. 1693, 19-33 (with M.K. Aguilera and B. Deianov).
On the quality of service of failure detectors. Proceedings of the First International Conference on Dependable Systems and Networks (also FTCS-30), IEEE Computer Society and IFIP WG 10.4, New York, NY (June 2000), 191-200 (with M.K. Aguilera and W. Chen).
Failure detector service for dependable computing. Fast abstract in Proceedings of the First International Conference on Dependable Systems and Networks (also FTCS-30), IEEE Computer Society and IFIP WG 10.4, New York, NY (June 2000), B14-B15 (with B. Deianov).
Revisiting safety and liveness in the context of failures. Proceedings of the 11th International Conference on Concurrency Theory, State College, Pennsylvania (August 2000) (with B. Charron-Bost).