I am a Principal Research Scientist in the Department of Computer Science at Cornell University in Ithaca, NY. I'm a member of the Systems and Networking group. I'm interested in distributed systems, particularly in their fault tolerance and scalability aspects. I'm associate editor for ACM Computing Surveys. I'm an elected member on the Steering Committee of NSF PRObE (Parallel Reconfigurable Observational Environment). I also play guitar and do sound for the Ageless Jazz Band, play traditional jazz banjo in The JazzHappensBand, am co-founder of and advisor to The Cornell Dutch Club, The Cornell Ukulele Club, and co-founder of The Finger Lakes One Wheelers, a unicycling club. I'm the webmaster of The Ithaca High School PSTA web site. I'm also the designer and webmaster of GigKeeper.com, a site that helps with the management side of playing in bands.
My Current Research
In datacenters, services are replicated for continuous availability. However, doing so poses a variety of problems. How do you present a single consistent service to clients while providing good efficiency and performance? How do you upgrade the software of such systems on- the-fly? How do you migrate replicas, or integrate new replicas? How do you support geographically distributed replicas and provide service even in the face of network partitioning? Also, in the face of having many services, how do you deal with errors that go well beyond simple crash failures? Traditional (and popular) solutions such as primary- backup, chain replication, and consensus-based state machine replication protocols fall short in one or more of these areas.
We are developing a new approach to replication protocols that carefully separates configuration management from the ordering of updates to replicas and leverages the fact that scaling mandates aggressive partitioning. We call the various partitions "segments." The configuration of a segment is handled by some other segment, called its "manager." To make sure that each segment has a manager, we organize segments into rings (so-called "elastic bands"), where each segment is managing its successor segment on the ring. We have protocols for inserting and deleting segments from a ring. Once configuration management is separated from ordering updates, the protocols for the latter can be quite simple. In case of a failure, we "wedge" one or more of the surviving replicas to stop further updates, reconfigure, and restart. When integrating a new replica into a configuration, we have the new replica page in the necessary state in an on-demand fashion in order to minimize the reconfiguration latency. Using these mechanisms we can answer most of the questions above satisfactorily, with the exception of geographical distribution for which we're still looking for a good solution.
We have finished an implementation of Ziryab, a key/value store based on elastic replication, and started a performance evaluation and are using standard benchmarks to compare Ziryab with other key/value stores. Interesting metrics include latency and throughput under load and availability under various failure scenarios. Ziryab also supports serializable ACID consistency by providing a timestamp-based transaction facility. Doing so allows multiple get/put operations to be grouped and executed atomically even if those keys are in different segments. There are standard benchmarks for transactional databases and we plan to run these against Ziryab and various well-known databases for comparison as well, again in the face of various failure scenarios.
This is joint work with Hussam Abu-Libdeh, Haoyan Geng, and David Goff.
Fault-tolerant TCP for a stable BGP
The issue here is that if a BGP (Internet routing) daemon for some reason dies, its peers see their TCP connections break. They will incorrectly conclude that the physical link is lost, and distribute routing table updates that leads to disruptions such as loops and black holes. While many BGP daemons can negotiate something called "Graceful Restart" that suppresses such updates if a TCP connection can quickly be re-established, the recovery involves synchronizing hundreds of thousands of table entries for each peer and this can take tens of minutes during which routing updates do not propagate through the router. Using standard hot-standby replication techniques it is possible to recover a BGP daemon in seconds, but because of the broken TCP connections recovery is still measured in tens of minutes.
We are developing TCPR, a NAT-like person-in-the-middle solution that modifies TCP packets in order to fix the issue. TCPR supports rapid recovery and also connection migration. TCPR, surprisingly, does not need to maintain any hard state, although it does rely on some state kept by the BGP daemon itself and the TCP stacks of its peers. The three most important ``tricks'' used by TCPR are the following. First, TCPR suppresses FIN packets that signal to the peers that the BGP daemon has died. Second, after the new BGP daemon starts with new TCP stacks for each peer, it rewrites TCP sequence numbers so that the peers believe the new TCP stacks are synchronized. Finally, TCPR delays acknowledgments to the peers until received updates have been successfully processed. This way no incoming data can get lost.
Current research questions are mostly focussed on performance. How many TCP connections can a single TCPR support? How much overhead is there in terms of latency and throughput as a function of the number of connections (BGP packets are signed, and because TCPR rewrites packets it has to recompute the HMAC on each packet)? How quickly can BGP recover with the use of TCPR as a function of the number of peers and the number of routing table entries, compared to plain BGP daemons and daemons that support graceful restart. How does TCPR interact with performance options such as window scaling?
This is joint work with Robert Surton and Ken Birman, as well as partners at Cisco.
Correct-by-construction Fault-tolerant Distributed Systems
Fault-tolerant distributed systems, algorithms, and protocols are notoriously hard to build. Going from a specification to an implementation involves many subtle steps that are easy to get wrong. Moreover, fault-tolerant algorithms require many sources of diversity in order to make them survive a variety of failures. Programmers, even different ones, tend to make the same mistakes when they are implementing code from a specification.
I'm developing techniques to derive distributed algorithms by stepwise refinement from specification. Each step can be formally checked. Moreover, one can often make different design choices as you make these refinements, leading to sources of diversity that can be exploited for fault tolerance. With the NuPrl (formal methods) group at Cornell I'm working to have this process be mostly automated, and we are already able to synthesize a variety of executable consensus protocols.
Joint work with Fred Schneider, Danny Dolev, Robert Constable and Mark Bickford.