
Optimal Failure Detector 
is an optimal
implementation of a multinode failure detector,
where the detector is tasked with monitoring a set of N other nodes.
It has two modes of operation.
In latencyminimizing mode, it takes as input an amount of bandwidth to spend on failure detection, and pings the monitored nodes at just the right frequency such that it will detect failures as early as possible without exceeding the given bandwidth cap.
In bandwidthminimizing mode, it takes as input a target failure detection latency, and pings the monitored nodes at just the right frequency such that it will detect failures within the targeted amount of detection latency using the minimal required amount of bandwidth.
The novelty of our failure detector lies in its use of an analytically
derived equation for determing the optimal pinging schedule to achieve
the desired performance goal.
is particularly suitable for systems where a node is tasked with monitoring the liveness status of a
large number of other nodes, such as peertopeer systems. While the derivation of the optimal schedule
is lengthy, the endresult is a closed form formula that is cheap to
evaluate. This yields a practical, simple implementation. We provide
an opensource implementation in Java and Python.
