Failure Detectors

Work on Failure Detectors at Cornell University by M. Aguilera, T. Chandra, W. Chen, B. Deianov, V. Hadzilacos and S. Toueg.
(This work is partially supported by the National Science Foundation under grants CCR-9402896 and CCR-9711403.)

[Latest version - PostScript]

An extended abstract will appear in the International Conference on Dependable Systems and Networks (ICDSN/FTCS-30), June 2000.

<Abstract> We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies (a) how fast the failure detector detects actual failures, and (b) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyse its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements, and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we briefly explain how to make our failure detector adaptive, so that it automatically reconfigures itself when there is a change in the probabilistic behavior of the network.