Thursday, March 27, 2008
4:15 pm
B17 Upson Hall

Computer Science
Colloquium
Spring 2008

Hakim Weatherspoon
Cornell University
 

Storage Systems for Global Scale Datacenters

Digital information plays an increasingly critical role in scientific research, military systems and other enterprises, and this trend has important implications. First, many systems are more and more being distributed over a global network of datacenters, which is emerging as an important distributed systems paradigm. Second, storage systems in these environments must ensure the durability, integrity, and accessibility of digital data, and do so under potentially turbulent conditions. For example, in large scale distributed systems, servers continuously fail; data should remain durable despite constant failure.

Antiquity is a distributed storage system designed for these sorts of challenging environments. It maintains data securely, consistently, and with high availability in a dynamic wide-area environment. At the core of the system is a novel secure log structure that permits Antiquity to guarantee the integrity of stored data, even under extreme stress. Data is replicated on multiple servers in a manner that ensures that it can be retrieved later even when some replicas are inaccessible. Moreover, unlike prior fault-tolerant systems, the Antiquity fault-tolerance protocols can handle high levels of node churn, regenerating data on the fly when necessary to handle faults ranging from server outages to Byzantine (malicious) attacks.

Further, I will present SMFS, a remote mirroring solution targeted for settings where high-speed high-latency links connect a pair of datacenters. SMFS provides strong disaster tolerance guarantees with asynchronous performance-mirroring response times are more typical of high-speed LAN setting. Not only does the approach provide reliability through mirroring, but there are conditions under which it offers dramatic power savings. Longer term, we see SMFS and Antiquity as two examples of a family of innovative solutions addressing a range of demanding problems seen in turbulent, mission-critical, and power constrained settings.

Hakim Weatherspoon is currently a post-doctoral fellow at Cornell University. His work covers various aspects of information systems, distributed systems, network systems, and peer-to-peer systems with focus on fault-tolerance, reliability, security, and performance of Internet-scale systems. He previously received his PhD from University of California, Berkeley in Computer science.