Cornell Systems Lunch

CS 7490 Fall 2024
Friday 11:45AM, Gates 114 and Bloomberg 201

Alex Conway and Hakim Weatherspoon


The Systems Lunch is a seminar for discussing recent, interesting papers in the systems area, broadly defined to span operating systems, distributed systems, networking, architecture, databases, and programming languages. The goal is to foster technical discussions among the Cornell systems research community. We meet once a week on Fridays at 11:45 AM Gates 114 and Bloomberg 201.

The systems lunch is open to all Cornell Ph.D. students interested in systems. First-year graduate students are especially welcome. Non-Ph.D. students have to obtain permission from the instructor. Student participants are expected to sign up for CS 7490, Systems Research Seminar, for one credit.

To join the systems lunch mailing list please send an empty message to cs-systems-lunch-l-request@cornell.edu with the subject line "join". More detailed instructions can be found here.

Links to papers and abstracts below are unlikely to work outside the Cornell CS firewall. If you have trouble viewing them, this is the likely cause.

The Zoom link is https://cornell.zoom.us/j/92777745794?pwd=pLFwl66g0H2zE2nojSEyslSu8nMG0z.1.

The Google calendar link is here.

Consider bringing your own plate/bowl, dinnerware, and reusable water bottle.


Date Paper Presenter
August 30 Organizational meeting

Alex and Hakim
September 6 Cross-Layered OS Design for Managing Memory and Storage Heterogeneity in Modern Datacenters
Abstract: Datacenter systems increasingly rely on heterogeneous architectures to manage data growth, enhance performance, optimize resource utilization, and lower energy costs. In this talk, I will present our vision for operating systems designed to handle extreme memory and storage heterogeneity. I will emphasize the need for a cross-layered OS design philosophy that distributes responsibilities across runtimes, firmware, and hardware controllers. This approach promotes collaborative data processing, improves scalability, performance, and resource efficiency, while preserving the data reliability and isolation guarantees of traditional monolithic OS designs.

To illustrate the potential of this cross-layered approach, I will discuss its application in managing near-storage accelerators and redesigning OSes for storage heterogeneity. I will then explore cross-stacked management of both fast and slow memory technologies. Finally, I will briefly touch upon the opportunities for enhancing current monolithic OSes.
Sudarsun Kannan (Rutgers)
September 13 Verifying Hardware Security Modules with Information-Preserving Refinement
Anish Athalye, M. Frans Kaashoek, and Nickolai Zeldovich (MIT CSAIL)
OSDI 2022
Suraaj Sureshkannan
September 20 Programming Distributed Systems
Abstract: Our interconnected world is increasingly reliant on distributed systems of unprecedented scale, serving applications which must share state across the globe. And, despite decades of research, we're still not sure how to program them! In this talk, I'll show how to use ideas from programming languages to make programming at scale easier, without sacrificing performance, correctness, or expressive power in the process. We'll see how slight tweaks to modern imperative programming languages can provably eliminate common errors due to replica consistency or concurrency---with little to no programmer effort. We'll see how new language designs can unlock new systems designs, yielding both more comprehensible protocols and better performance.
Mae Milano (Princeton)
September 27 Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Duncan Roweth, Filippo Spiga, Salvatore Di Girolamo, Torsten Hoefler
Supercomputing 2024
Julian Bellavita
October 4 Exploiting Leakage in Password Managers via Injection Attacks
Andrés Fábrega, Armin Namavari, Rachit Agarwal, Ben Nassi, Thomas Ristenpart
USENIX Security 2024
Andrés Fábrega
October 11 Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee
OSDI 2024
Abhishek Vijay
October 18 A Decentralized SDN Architecture for the WAN
Alexander Krentsel, Nitika Saran, Bikash Koley, Subhasree Mandal, Ashok Narayanan, Sylvia Ratnasamy, Ali Al-Shabibi, Anees Shaikh, Rob Shakir, Ankit Singla, Hakim Weatherspoon
SIGCOMM 2024
Nitika Saran
October 25 Tiered Memory Management: Access Latency is the Key!
Midhul Vuppalapati and Rachit Agarwal
SOSP 2024
Midhul Vuppalapati
November 1 Enhancing National Security with AI-driven Cybersecurity
Abstract: Imagine a world without software system vulnerabilities. Imagine, too, a world with AI systems we can trust. Dr. Matthew Turek, Deput Director of the Defense Advanced Research Project Agency’s (DARPA) Information Innovation Office (I2O) will present the Agency’s approach to achieving both visions. Today’s systems contain hidden vulnerabilities that can be exploited by an adversary. We have built systems so complex that interdependencies make them marginally stable. We’ve seen amazing progress in machine learning, yet the performance appears flashy and not yet trustworthy. Dr. Turek will discuss how DARPA thinks about investing in solutions to these problems, highlight selected programs and results from the office, and demonstrate how we disrupt the state-of-the-art.
Matt Turek (DARPA Information Innovation Office Deputy Director)
November 8 Dynamic Channel Optimization and Jamming Using “Informed” Contextual Multi Armed Bandits
Abstract: Quality adaptive channel selection improves reliability and performance by dynamically choosing the best channels, reducing interference, and optimizing data transmission. This leads to efficient resource use and consistent communication. We have been applying our novel /informed Contextual Multi Armed Bandit/ (iCMAB) framework to perform several channel-related tasks including I) Dynamic Channel Access and Jamming (Offensive and Defensive). iCMABs extends upon conventional Multi-Armed Bandits (MABs) by actively leveraging acquired information and past experiences to enhance decision-making, considering a wider range of contextual factors. iCMAB exploits contextual features such as signal-to-noise ratio (SNR), channel occupancy, interference patterns, and user mobility to dynamically allocate communication channels with the highest expected reward. By continuously updating their belief states based on real-time feedback, iCMABs can make near-optimal decisions in selecting channels that minimize interference and maximize throughput. This approach enhances spectrum efficiency and ensures more reliable communication in environments characterized by high variability and uncertainty. iCMAB can support jamming operations by dynamically adapting jamming strategies based on real-time contextual information, such as the detection of adversarial signals, frequency usage patterns, and environmental conditions, to maximize disruption effectiveness while minimizing detection.
Daniel Krutz (RIT)
November 15 Fast and Safe IO Memory Protection
Benny Rubin, Saksham Agarwal, Qizhe Cai, Rachit Agarwal
SOSP 2024
Benny Rubin
November 22 Autobahn: Seamless High Speed BFT
Neil Giridharan, Florian Suri-Payer, Ittai Abraham, Lorenzo Alvisi, Natacha Crooks
SOSP 2024
Neil Giridharan (Berkeley)
November 29 No Lecture -- Thanksgiving

 
December 6 Unblocking AI: Understanding and Overcoming Datacenter Network Bottlenecks in Distributed AI Training
Abstract: As companies invest heavily in building new datacenters dedicated to AI training, a critical and often underestimated challenge persists: datacenter networks remain a persistent bottleneck in distributed AI training. Despite advances in computing hardware and machine learning algorithms, network congestion and communication overhead continue to hinder the scalability and efficiency of AI workloads. In this talk, I will share insights from a comprehensive study I led, where a team of researchers and engineers instrumented networks and analyzed traffic patterns across over 20 AI datacenters of a hyperscaler. Our investigation revealed key insights into AI workload characteristics, the root causes of network bottlenecks, and the challenges of resolving them. Building on our findings, I will present novel datacenter designs that address these constraints. These designs challenge traditional network paradigms, such as relying on shortest-path routing or maintaining strict packet ordering, and instead embrace more flexible network strategies. I will demonstrate how these techniques can effectively pinpoint and resolve network bottlenecks, leading to significant performance improvements. I will conclude by discussing open research questions and future directions.
Soudeh Ghorbani (Meta and Johns Hopkins University)
December 13 Yours, Mine, and Ours: Efficient Set Reconciliation in O(n log n) of the SET DIFFERENCE
Abstract: We will discuss and explain a new algorithm that empowers efficient reconciliation of sets. Extremely large sets can be reconciled in O(n log n) of the SET DIFFERENCE, not the underlying size of the sets. The algorithm is a variant of erasure codes (familiar to the database community) and fountain codes (familiar to the data communications community). This opens new avenues for solutions based on repairing sets that do not even yet exist! When distributed systems agree in advance what items belong in a set, different participants can add items to the set effectively performing replica repair over future content. We will explain the set reconciliation algorithm presented at SIGCOMM 2024 in August within a paper titled "Practical Rateless Set Reconciliation" by Yang et al and how it can accomplish such efficiency. Many disparate research opportunities are opened by this algorithm including replica repair (faster than Merkle Trees), improved gossip protocols, scientific computations including detecting small differences in large genomes, management of cloud based control planes, and possibly even improvements to multi-phase protocols used for distributed systems. We hope to conclude with the audience brainstorming about even more possible applications.
Pat Helland and Daniel May (Salesforce)