My recent research effort, in collaboration with M. Aguilera and W. Chen, is on the use of unreliable failure detectors for designing reliable distributed systems. We studied the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first proposed new failure detectors that are particularly suitable to the crash-recovery model. We next determined under what conditions stable storage is necessary to solve consensus in this model. Using the new failure detectors, we gave two consensus algorithms that match these conditions: one requires stable storage and the other does not. Both algorithms tolerate link failures and are particularly efficient in the runs that are most likely in practice those with no failures or failure detector mistakes.
Director: Master of Engineering Program, Computer Science Department