The Systems group at Cornell examines the design and implementation of the fundamental software systems that form our computing infrastructure. Below we give just a small representation of the varied systems work going on here, and invite you to visit the project and faculty web pages, as well as read our papers.

 

Cloud Computing

The increasing popularity of cloud storage is leading organizations to consider moving computing and data out of their own data centers and into the cloud. However, success for cloud storage providers can present a significant risk to customers; namely, it becomes very expensive to switch storage providers. Hakim Weatherspoon and his students have been working on various solutions to reduce lock-in risks for cloud customers (SOCC 2010, HotCloud 2011).

Emin Gun Sirer's group has proposed proposed an unorthodox topology for datacenters that eliminates all hierarchical switches in favor of connecting nodes at random according to a small-world- inspired distribution (SOCC 2011). Surprisingly, these Small-World Datacenters can achieve higher bandwidth and fault tolerance compared to both conventional hierarchical datacenters

Hakim Weatherspoon and his students are looking at designing datacenter storage systems that are frugal with energy use (HotOS 2007).

Van Renesse and his students are working on scalable replication techniques for the cloud. They have recently developed something they call Elastic Replication where configuration management is carefully separated from ordering updates to replicas. Replicated objects cooperatively are responsible for each other's configuration. Doing so, management can be aggressively decentralized for scalability while strong consistency is maintained.

Ken Birman has been developing a new library for building highly assured cloud computing systems that use replication and coordinate actions; he calls it Isis2 and you can download it from isis2.codeplex.com.  Isis2 automates such tasks as updating replicated data, locking, moving files around and changing their replication patterns, checkpointing and restarting from checkpoints, etc.  A recent focus has been on wide-area communications reliability and speed, and Ken is especially interested in understanding how to leverage remote DMA (RDMA) network hardware, which can move big blocks of data at raw network speeds. A major application area for his group is the smart power grid, which needs to capture data from the power grid at cloud-scale and then analyze it so that power delivery can be matched to demand, renewables can be integrated with more traditional energy sources, and the grid can be protected against mishaps or attacks

For summaries of additional cloud computing research at Cornell see cloudcomputing.cornell.edu.

 

Distributed Systems and Fault Tolerance

Cornell is particularly well-known for its foundational and practical work on fault-tolerant distributed systems. Ken Birman's book on reliable distributed systems is widely used in classrooms and industry (a new edition will be published early in 2012). His Isis toolkit system (SOSP 1985, SOSP 1987) was used extensively in industry for building fault-tolerant systems for decades, and the new version, Isis2, aims at scalable reliability for cloud computing systems (a paper on this work recently appeared at TRIOS 13); other such systems created by Birman and Van Renesse in the past decade include Ensemble (SOSP 1999). Fred Schneider's oft-referenced and ACM Hall of Fame award-winning State Machine Replication tutorial is standard fare in systems courses around the world (ACM Computing Surveys 1990). Van Renesse and Schneider invented and analyzed the Chain Replication paradigm (OSDI 2004), which is now used by several large Internet services. Birman, Schneider, and Van Renesse all were major contributors to the 2010 Springer book "Replication: Theory and Practice", a comprehensive book on replication techniques. Van Renesse and Schneider are currently investigating building robust distributed systems based on stepwise refinement. Andrew Myers and his students are working on Fabric, a federated, distributed system for securely and reliably storing, sharing, and computing information (SOSP 2009). Fabric presents a single-system image of all resources that can be named by it, and provides security guarantees to mutually distrusting principals using it, but it is a decentralized system with no centralized security enforcement mechanism. Van Renesse and his students developed Nysiad (NSDI 2008) a system that implements a new technique for transforming a scalable distributed system or network protocol tolerant only of crash failures into one that tolerates arbitrary failures, including such failures as freeloading and malicious attacks. Currently, he is working with Robert Constable of the NuPrl group on automatically synthesizing fault tolerant algorithms such as consensus.  A recent paper at SOSP 13 describes work that emerged from a colloboration between Birman and Van Renesse's group at Cornell and Facebook.

 

Networking

Emin Gun Sirer and his students have proposed a technique to create more expressive, futuristic networks that enable users to query them about their properties (SIGCOMM 2011). Called NetQuery, this technique leverages emerging secure coprocessors to establish ground truths about network elements, and facilitates automated reasoning about whether a network exhibits a particular characteristic (e.g. capacity, low loss rate, availability of redundant paths, etc). In separate work, Sirer's group proposed a new way of distributing multimedia content using a hybrid peer-to-peer architecture (SIGCOMM 2011). This scheme combines the advantages of client-server systems with the high capacity of peer-to-peer systems, and improves on both of these architectures by quantitatively evaluating the marginal benefit of available bandwidth to competing consumers, enabling efficient utilization of upload bandwidth. Nate Foster, in cooperation with researchers at Princeton, has been developing high-level languages, such as Frenetic, for programming distributed collections of enterprise network switches (ICFP 2011, HotNets 2011, POPL 2012). The languages allow modular reasoning about network properties. Ken Birman, Robbert van Renesse, Hakim Weatherspoon, and their students are working with researchers from other academia and industry on the Nebula project, whose goal is to address threats to the cloud while meeting the challenges of flexibility, extensibility and economic viability (IEEE Internet Computing 2011). One artifact that came out of this work is TCPR, a tool that fault-tolerant applications can use to recover their TCP connections after crashing or migrating; it masks the application failure and enables transparent recovery, so the remote peers remain unaware. Another artifact under development is SoNIC (Software-defined Network Interface Card), which provides precise and reproducible measurements of an optical lambda network. By achieving extremely high levels of precision, SoNIC can shed light on the complexities of flows that traverse high-speed networks (IMC 2010, DSN 2010). Andrew Myers and Emin Gun Sirer developed Trickles (NSDI 2005, ACM TOCS 2008), a novel TCP-like transport protocol and a new interface to replace sockets that together enable all state to be kept on one endpoint, allowing the other endpoint, typically the server, to operate without any per-connection state.  Birman and Van Renesse collaborated with Cisco to create a high availability option for the Cisco CRS-1 backbone network routers (DSN 13).


Operating Systems

Cornell is one of the few institutions world-wide where brand new research operating systems are being actively developed. Sirer and Schneider have recently built the Nexus (SOSP 2011, TOISS 2011, SIGCOMM 2011, OSDI 2008), a new operating system that enables trustworthy computing. The Nexus enables a new class of applications that can provide unprecedented confidentiality and integrity guarantees; as a demonstration, they have built a social networking application, called Fauxbook, where the users can be assured that their information can not be leaked, even with Fauxbook developers acting as adversaries. Hakim Weatherspoon is currently working on multi-core extensions to the Linux operating system, as well as file system mirroring across high bandwidth, high latency links (FAST 2009). Fred Schneider built a replicated UNIX system using virtual machine technology (SOSP 1995, ACM TOCS 1996). Emin Gun Sirer developed a new architecture for networked virtual machines (SOSP 1999), and was the main kernel designer for the SPIN extensible operating system (SOSP 1995).

Peer-to-Peer Systems

Cornell faculty have done extensive work in the Peer-to-Peer networking area, ranging from file sharing to media streaming to network monitoring. Emin Gun Sirer and his students have created a large number of P2P systems, including Blindfold (IPTPS 2010), a scheme to ensure that the operators of content aggregators are completely blind to the content that they are storing and serving, thereby eliminating the possibility to censor content at the servers; Antfarm (NSDI 2009), a content distribution system based on managed swarms; Octant (NSDI 2007), a system for geolocation of Internet hosts; and Beehive, a peer-to-peer replication system on which a new DNS, an Internet-scale Publish-Subscribe system and a new content distribution network were built (NSDI 2006, SIGCOMM 2004). Sirer's Credence system (NSDI '05) for determining authentic content on peer-to-peer systems was deployed on the Gnutella network and led to a large-scale user study. The Karma system explored peer-to-peer currencies long before Bitcoin. Van Renesse developed Fireflies, a Byzantine-tolerant P2P overlay network (Eurosys 2006). Van Renesse and Birman developed the highly scalable Astrolabe network monitoring system (IPTPS 2002, ACM TOCS 2003), now used at a major e-retailer.  Birman's Kelips system was the first one-hop DHT, and his Bimodal Multicast protocol was unusual in using P2P communication as a tool in a reliable multicast protocol. Weatherspoon designed and implemented the Antiquity system, a secure P2P storage facility (Eurosys 2007).

Cross-Cutting Research Areas

Besides the topics mentioned above, the systems faculty is also actively involved with cross-cutting research involving Security, Programming Languages, Computer Architecture, and Theory. One example of cross-cutting research is the Meridian (SIGCOMM'05) project, where Sirer and his group applied some of the insights from Jon Kleinberg's theoretical work on small-world networks to peer-to-peer systems. Meridian, and its successor, Cubit, can route queries to their destinations efficiently over a lightweight overlay, providing new search functionality not possible with pereceding peer-to-peer systems. The Cubit plugin for Vuze was downloaded by more than 30000 people. Another example of cross-cutting research is the work Birman's group has done on Web 2.0 collaboration. This effort looked at the challenges of using Web 2.0 technologies (mashups) in support of demanding collaboration applications, such as one sees in the military or in hospitals. As part of this effort, they built a mashup technology of their own, Live Objects (ECOOP 2008, Middleware 2008). Have a look at the demo.

Environment

The Systems Group at Cornell prides itself on its collegial internal environment. Our Systems Lunches, where professors and graduate students get together every Friday to have lunch and discuss recent, cutting-edge papers in the field, draws an attendance of 40–60 people and has been adopted by many other institutions. And Cornell's Systems Lab, a large collaborative space with wall-to-wall whiteboards, projectors, sound systems and work areas for up to three dozen people, has served as a crucible where people hack together on projects and design new systems.