CS Colloquium
Thursday, April 1, 2004
B17 Upson Hall

Dave Andersen

Improving the End-to-End Availability of Internet-Based Systems

The end-to-end availability of Internet services is between two and three orders of magnitude worse than other important engineered systems, including the US airline system, the 911 emergency response system, and the US public telephone system. This talk makes two contributions to improve end-to-end availability. First, a study of three years of data collected on a 31-site testbed explores why failures happen, and finds that access network failures, inter-provider and wide-area routing anomalies, domain name system faults, and server-side failures all have a role to play in reducing availability.

Second, an overlay network with new algorithms for end-to-end path selection improves availability by one or two orders of magnitude compared to the current state. A purely overlay-based system, RON (resilient overlay networks), deploys nodes in different organizations and networks, carefully measures and monitors the status of available paths, and relies on them to cooperatively route packets by way of each other to bypass faults. A second system, MONET (multihomed overlay networks), uses a combination of physical path redundancy and overlays within a network of cooperative Web proxies to improve the availability for Web users. Experimental evidence suggests that RON can reduce failures by a factor of six, and that with physical path redundancy, a six-site MONET eliminates almost all network-based failures.