Cornell Systems Lunch
CS 7490 Spring 2023
The Systems Lunch is a seminar for discussing recent, interesting papers in the systems area, broadly defined to span operating systems, distributed systems, networking, architecture, databases, and programming languages. The goal is to foster technical discussions among the Cornell systems research community. We meet once a week on Fridays at 11:30 AM Gates 114.
The systems lunch is open to all Cornell Ph.D. students interested in systems. First-year graduate students are especially welcome. Non-Ph.D. students have to obtain permission from the instructor. Student participants are expected to sign up for CS 7490, Systems Research Seminar, for one credit.
To join the systems lunch mailing list please send an empty message to email@example.com with the subject line "join". More detailed instructions can be found here.
Links to papers and abstracts below are unlikely to work outside the Cornell CS firewall. If you have trouble viewing them, this is the likely cause.
The Zoom link is https://cornell.zoom.us/j/96859053610?pwd=TjRvck9sZklqQjFoQmdDazNuYm5Tdz09.
|January 27||BeeGees: stayin' alive in chained BFT
Abstract: Modern chained Byzantine Fault Tolerant (BFT) systems leverage a combination of pipelining and leader rotation to obtain both efficiency and fairness. These protocols, however, require a sequence of three or four consecutive honest leaders to commit operations. Therefore, even simple leader failures such as crashes can weaken liveness both theoretically and practically. Obtaining a chained BFT protocol that reaches decisions even if the sequence of honest leaders is non-consecutive, remains an open question. To resolve this question we present BeeGees, a novel chained BFT protocol that successfully commits blocks even with non-consecutive honest leaders. It does this while also maintaining quadratic word complexity with threshold signatures, linear word complexity with SNARKs, and responsiveness between consecutive honest leaders. BeeGees reduces the expected commit latency of HotStuff by a factor of three under failures, and the worst-case latency by a factor of seven.
Bio: Neil Giridharan is a third year PhD student advised by Natacha Crooks studying distributed systems with a focus on BFT consensus protocol
|Neil Giridharan (Berkeley)|
|February 3||Cancelled, no meeting.|
|February 10||Carat Cake: replacing paging via compiler/kernel cooperation
Brian Suchy et al., Northwestern University
|February 17||IBM Research and Hybrid Multicloud
Dr. Andrew Anderson and Dr. Braulio Dumba, IBM Research
NOTE: meeting will be at 2pm in Gates 114. This lecture will provide an overview of IBM Research cutting-edge work in the areas of hybrid cloud, artificial intelligence, quantum computing, and core sciences. The discussion will highlight IBM Research efforts to develop and integrate these technologies to solve real-world problems and shape the future of computing.
|A. Anderson and B. Dumba (IBM)|
|February 24||Cancelled, no meeting.|
|March 3||Cancelled, no meeting.|
|March 10||Safe permissionless consensus
Youer Pu, Lorenzo Alvisi, and Ittay Eyal
|March 17||Understanding and Optimizing ML Data Storage and Ingestion Systems
Recent breakthroughs in machine learning models have been powered by datacenter-scale AI training clusters consisting of thousands of accelerators. These clusters fundamentally rely on a data storage and ingestion (DSI) pipeline, consisting of numerous systems that generate, store, and preprocess massive amounts of training data. As accelerators continue to push training efficiency and throughput, DSI infrastructure is becoming the dominant factor that constrains a datacenter’s overall training performance and capacity. To continue scaling ML systems, innovations in DSI infrastructure are urgent. To this end, I will present an overview of the industry-scale DSI systems used to train the deep learning recommendation models (DLRMs) that dominate ML demand at Meta. Guided by a deep characterization of these systems, I will synthesize key takeaways and motivate further research towards optimizing DSI hardware and software. Building on these takeaways, I will then discuss RecD, a suite of infrastructure optimizations centered around deduplicating DLRM datasets. I will explore how RecD drastically improves the performance and efficiency of Meta’s end-to-end ML training pipeline, including storage, preprocessing, and training systems.
|Mark Zhao (Stanford)|
|March 24||Cancelled, no meeting.|
|March 31||Designing Exascale Distributed Systems
|Saurabh Kadekodi (Google)|
|April 7||Spring Break, no meeting.|
|April 14||Cancelled, no meeting.|
|April 21||The End of Testing? The Promise of Verification-Driven Software Engineering
Most of the effort of engineering goes into preventing and correcting bugs. Practicing engineers greatly value tools that move bug discovery earlier, from deployment to testing, or better yet from testing to compile-time type checks. Systems software verification is the application of machine-checked proofs to software engineering. It is the limiting case of eager bug discovery: once a program passes verification, it has no bugs except for those in the spec that describes its contract and its environment. The primary impediment to replacing testing with software verification is cost: Historically, proofs were so fantastically expensive that you would never even try until the program was empirically bug-free. Contemporary developments are driving these costs way down. The promise of verification-driven engineering is emerging into reality.
Bio: Jon Howell is a Principal Researcher in the VMware Research Group pursuing the applicability and deployment of verification as a practical engineering tool. He has been part of the Ironclad, IronFleet, and VeriBetrFS systems verification projects, and is presently building a verified high-performance storage system.
|Jon Howell (VMware)|
|April 28||Cancelled, no meeting.|
|May 5||Cancelled, no meeting.|
|May 12||Cancelled, no meeting.|
|May 19||Efficient Transactional Causal Consistency for Serverless Computing and its applications on Microservice Architectures
In this talk we describe a system augments a Function-as-a-Service middleware with support for Transactional Causal Consistency (TCC). We propose a novel architecture to support TCC in FaaS, named FaaSTCC, that significantly reduces the inter-worker coordination overhead. FaaSTCC achieves this goal by augmenting the workers with a caching layer and by implementing mechanisms that maximize the cache usage. We show that using these techniques, TCC can be implemented efficiently. Then, we also discuss how we plan to extend this work to help in the process of migrating monolithic applications to microservice architectures.
Bio: Luís Rodrigues is a Professor at Instituto Superior Técnico (IST), Universidade Lisboa, and a member of the "Distributed, Parallel and Secure Systems Group" at the INESC-ID research laboratory. The focus of his work is on researching and teaching algorithms for building reliable distributed systems.