Systems Research Seminar

Leading experts explore advances in computing.

The Department of Computer Science hosts a seminar series featuring leading experts who share insights on cutting-edge developments in computing. The sessions explore innovative approaches to system optimization, computational modeling, and algorithmic problem-solving.

Stay in the loop.
Email us and put join in the subject line. Leave the body of the email blank.

%20cs-systems-lunch-l-request [at] cornell.edu (Join the event email list)

Spring 2026 Schedule

Systems Research Seminars take place Fridays in Gates 122 from 11:45 a.m. - 12:45 p.m.
Click here to attend via zoom

Date: January 23, 2026
Speaker: Haiying Shen, Associate Professor, Computer Science, University of Virginia
Title: Resource Optimization for ML Inference Serving
Host: Robbert van Renesse

Date: January 30, 2026
Speaker: Guy Amir, Postdoctoral Researcher, Cornell University
Title: Deciding Serializability in Network Systems (TACAS 2026)
Host: Rachee Singh

Date: February 6, 2026
Speaker: Jinkun Lin, Postdoctoral Researcher, Cornell University
Title: Understanding Stragglers in Large Model Training Using What-if Analysis
Host: Rachee Singh

Date: February 13, 2026
Speaker: Zhiyuan Guo
Title: Virtual Decoupled Cores: A Composable Runtime for Modern Async GPUs
Host: Adrian Sampson

Date: February 20, 2026
Speaker: Xilin Tang
Title: Scalable Far Memory: Balancing Faults and Evictions
Host: Alex Conway

Date: February 27, 2026
Speaker: Yunxi Shen, Cornell University
Title: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models
Host: Hakim Weatherspoon

Date: March 6, 2026
Speaker: Hongbo Zhang
Title: FicusDB: Scalable Multi-Versioned Authenticated Archival Storage
Host: Robbert van Renesse

Date: March 13, 2026
ACSU student/faculty lunch

Date: March 20, 2026
Speaker: Abhishek Kumar
Title: TBD
Host: Rachee Singh

Date: March 27, 2026
Speaker: Shuangyu Lei
Title: TBD
Host: Hakim Weatherspoon

Date: March 29, 2026
Memorial seminar in honor of Joseph Halpern

Date: April 10, 2026
Speaker: Jamal Hashim
Title: TBD
Host: Rachee Singh

Date: April 17, 2026
Speaker: Austin Li
Title: TBD
Host: Lorenzo Alvisi

Date: April 24, 2026
Speaker: Yuqi Mai
Title: TBD
Host: Alex Conway

Date: May 1, 2026
Speaker: Bhaskar Kataria
Title: TBD
Host: Rachee Singh

Date: May 8, 2026
Speaker: Omar Eqbal
Title: TBD
Host: Rachit Agarwal

Past Events

Browse past lectures.

Fall 2025 Events

Date: September 12, 2025
Speaker: Harold Triedman
Title: Multi-Agent Systems Execute Arbitrary Malicious Code

Date: September 19, 2025
Speaker: Daniel Lee
Title:Mako: Speculative Distributed Transactions with Geo-Replication
Host: Lorenzo Alvisi

Date: September 26, 2025
Speaker: Karuna Grewal
Title:ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta
Host: Justin Hsu

Date: October 3, 2025
Speaker: Charly Castes
Title: Securing Systems Foundations: The Design and Verification of a Virtual Firmware Monitor
Host: Andrew Myers

Date: October 10, 2025
Speaker: Simon Bertron
Title: Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases
Host: Alex Conway

Date: October 17, 2025
Speaker: Salman Abid
Title: PAPAYA Federated Analytics Stack: Engineering Privacy, Scalability and Practicality
Host: Hakim Weatherspoon

Date: October 31, 2025
Speaker: Jiaxin Lin
Title: Building Next-Generation Accelerated Data Center Networking Systems
Host: Rachee Singh
Click here to view recording

Date: November 7, 2025
Speaker: Alicia Yang
Title: PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Host: Ken Birman

Date: November 14, 2025
Speaker:Maria Apostolaki, Princeton
Title:Logic-Guided Machine Learning for Autonomous Network Management
Host: Rachee Singh

Date: November 21, 2025
Speaker:Jiayi (Jane) Chen, UT Austin
Title: ML for Networks in the Real World: Challenges Beyond Training the Models
Host: Alex Conway

Date: December 5, 2025
Speaker: Keting Chen
Title:MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training
Host: Hakim Weatherspoon

Spring 2025 Events

01.31.25: Towards Swap-Free, Continuous Ballooning for Fast, Cloud-Based Virtual Machine Migrations
Speaker: Kevin Negy, Cornell
Host: Hakim Weatherspoon
Abstract: Ballooning is one technique for modifying the size of a virtual machine and has been used to speed up VM migration and increase VM consolidation. However, it has a significant risk: the ominous out-of-memory (OOM) error. The issue is that it is infeasible to use ballooning during high-risk scenarios, namely during giant memory spikes and during live migration, for fear of swapping or worse, OOM errors.
We advance the state of the art by optimizing the Linux balloon driver for VM migration in a non-overcommitted context, resulting in being able to handle both high-risk scenarios without relying on swapping and without causing OOM errors. We add a user-space continuous ballooning program that, in tandem with our balloon driver modifications, can handle memory spikes of hundreds of gigabytes, as well as survive an indefinite number of migrations.

02.07.25: Towards an Algebraic Theory of Systems
Speaker: Suraaj Sureshkannan, Cornell
Host: Andrew Myers
Abstract: Computer systems are built by composing together smaller components. This work aims to identify the key requirements for system composition, and introduces the concept of a 'system algebra', aiming to formalize various composition operations one can perform on systems. In this setting, the idea of composition-order invariance is formalized, which asserts that the order in which systems are composed or connected does not affect the final result. The framework is applicable to a variety of systems, including physical systems, electronic circuits, and distributed networks.
In addition, a hierarchy of useful "models" of this system algebra is presented. Of particular interest to Computer Science are 'functional system algebras' of which Kahn Networks and Causal Networks are prominent examples.

02.14.25: DiskANN: Fast, Accurate, Billion-point Nearest Neighbor Search on a Single Node
Speaker: Ben Landrum, Cornell
Host: Alex Conway
Abstract: Current state-of-the-art approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search. This makes them expensive and limits the size of the dataset. We present a new graph-based indexing and search system called DiskANN that can index, store, and search a billion point database on a single workstation with just 64GB RAM and an inexpensive solid-state drive (SSD). Contrary to current wisdom, we demonstrate that the SSD-based indices built by DiskANN can meet all three desiderata for large-scale ANNS: high-recall, low query latency and high density (points indexed per node). On the billion point SIFT1B bigann dataset, DiskANN serves > 5000 queries a second with < 3ms mean latency and 95%+ 1-recall@1 on a 16 core machine, where state-of-the-art billion-point ANNS algorithms with similar memory footprint like FAISS and IVFOADC+G+P plateau at around 50% 1-recall@1. Alternately, in the high recall regime, DiskANN can index and serve 5 − 10x more points per node compared to state-of-the-art graph- based methods such as HNSW and NSG. Finally, as part of our overall DiskANN system, we introduce Vamana, a new graph-based ANNS index that is more versatile than the graph indices even for in-memory indices.

02.21.25: A survey of methods for synthesizing collective communication algorithms
Speaker: Shouxu Lin, Cornell
Host: Rachit Agarwal
Abstract: Exploiting parallelism to train machine learning models requires GPUs to collaborate effectively through collective communication to transfer data between GPUs, which becomes a significant bottleneck in training large models. Thus, it is important to design efficient algorithms for collective communication to reduce the overhead in terms of end-to-end latency. However, designing optimal algorithms is challenging because it depends on communication patterns and underlying physical topologies and requires a fairly multi-dimensional space in terms of virtual topologies, mapping of virtual to physical topologies, and very complicated schedules. The community has been exploring various approaches to synthesizing collective communication algorithms. This talk examines key design considerations, evaluates existing synthesis approaches, discusses their advantages and limitations, and outlines unresolved challenges for future research.

02.28.25: Cross-stack Design for Sustainable AI Infrastructure
Speaker: Yueying (Lisa) Li, Cornel
Host: Hakim Weatherspoon
Abstract: The rapid increase in LLM ubiquity and scale levies unprecedented demands on computing infrastructure. These demands not only incur large compute and memory resources, but also significant energy, yielding large operational and embodied carbon emissions. In this work, we present three main observations based on modeling and traces from production deployment of two Generative AI services in a major cloud service provider. First, while GPUs dominate operational carbon, host processing systems (e.g., CPUs, memory, storage) dominate embodied carbon. Second, offline, batch inference accounts for a significant portion (up to 55%) of serving capacity. Third, there are different levels of heterogeneity across hardware and workloads for LLM inference. Based on these observations, we design EcoServe, a carbon-aware resource provision and scheduling framework for LLM serving systems.
It is based on four principles - Reduce, Reuse, Rightsize, and Recycle (4R). With a cross stack ILP formulation and design, we demonstrate that EcoServe can lower carbon emissions by up to 47%, compared to performance, energy, and cost-optimized design points, while maintaining performance targets and SLOs.

03.07.25: Strategies for Training Massive AI Workloads
Speaker: Tanmaey Gupta, Cornell University
Host: Christopher De Sa
Abstract: The rapid advancement of deep learning for generative tasks has shown strong scaling laws where the model performance increases proportional to its size. This has led to the proliferation of machine learning models with billions and trillions of parameters. Training such large-scale models presents significant challenges in memory efficiency, compute utilization, and communication overhead. Solving these challenges requires non-trivial strategies for parallelizing and synchronizing models at scale. This talk explores the landscape of training performant models at large scales, and discusses various techniques such as 5D Parallelism, DeepSpeed and FSDP. We discuss the trade-offs between various methods in terms of memory efficiency, communication overhead, and compute intensity, offering insights into their optimizations. Finally, we delve into the best practices and practical implementation insights for training large models.

03.14.25: Snowflake, a censorship circumvention system using temporary WebRTC proxies
Speaker: James Austgen, Cornell
Host: Hakim Weatherspoon
Abstract: Snowflake is a system for circumventing Internet censorship. Its blocking resistance comes from the use of numerous, ultra-light, temporary proxies ("snowflakes"), which accept traffic from censored clients using peer-to-peer WebRTC protocols and forward it to a centralized bridge. The temporary proxies are simple enough to be implemented in JavaScript, in a web page or browser extension, making them much cheaper to run than a traditional proxy or VPN server. The large and changing pool of proxy addresses resists enumeration and blocking by a censor. The system is designed with the assumption that proxies may appear or disappear at any time. Clients discover proxies dynamically using a secure rendezvous protocol. When an in-use proxy goes offline, its client switches to another on the fly, invisibly to upper network layers.

03.21.25: AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains
Speaker: Abhishek Vijaya Kumar, Cornell
Host: Rachee Singh
Abstract: Inference on large-language models (LLMs) is constrained by GPU memory capacity. A sudden increase in the number of inference requests to a cloud-hosted LLM can deplete GPU memory, leading to contention between multiple prompts for limited resources. Modern LLM serving engines deal with the challenge of limited GPU memory using admission control, which causes them to be unresponsive during request bursts. We propose that preemptive scheduling of prompts in time slices is essential for ensuring responsive LLM inference, especially under conditions of high load and limited GPU memory. However, preempting prompt inference incurs a high paging overhead, which reduces inference throughput. We present Aqua, a GPU memory management framework that significantly reduces the overhead of paging inference state, achieving both responsive and high-throughput inference even under bursty request patterns. We evaluate Aqua by hosting several state-of-the-art large generative ML models of different modalities on servers with 8 Nvidia H100 80G GPUs. Aqua improves the responsiveness of LLM inference by 20X compared to the state-of-the-art and improves LLM inference throughput over a single long prompt by 4X.

03.28.25: uMMU: Securing Data Confidentiality with Unobservable Memory Subsystem
Speaker: Samuel Breckenridge, Cornell
Host: Hakim Weatherspoon
Abstract: Ensuring data confidentiality in a computing system's memory hierarchy proved to be a formidable challenge with the large attack surface. Diverse and powerful attacks threaten data confidentiality. Memory safety is notoriously hard to achieve with unsafe languages, thereby empowering adversaries with unauthorized memory accesses, as represented by the HeartBleed incident. More recently, microarchitectural side channel attacks reign as a prevalent threat against data confidentiality that affects program execution including the safeguarded ones inside TEEs.
In this paper, we introduce an in-process memory subsystem called uMMU. uMMU coherently consolidates the notion of employing processor registers as unobservable storage with data confidentiality protection techniques such as memory encryption and Oblivious RAM. uMMU creates a new address space called uVirtual address space that is unobservable to adversaries. Under the abstraction created by uMMU, the processor's spacious extended registers, such as Intel x86's AVX512, are transformed into unobservable and addressable physical memory backing.

04.18.25: NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall
Speaker: Austin Li, Cornell
Host: Lorenzo Alvisi
Abstract: Strictly serializable datastores greatly simplify application development. However, existing techniques pay unnecessary costs for naturally consistent transactions, which arrive at servers in an order that is already strictly serializable. We exploit this natural arrival order by executing transactions with minimal costs while optimistically assuming they are naturally consistent, and then leverage a timestamp-based technique to efficiently verify if the execution is indeed consistent. In the process of this design, we identify a fundamental pitfall in relying on timestamps to provide strict serializability and name it the timestamp-inversion pitfall. We show that timestamp inversion has affected several existing systems.

04.25.25: Datacentric Multi-Acceleration at Scale
Speaker: Mohammad Alian, Cornell
Host: Adrian Sampson
Abstract: So far, we have relied on technology scaling, system scale-out, and specialization within the single domain of deep neural networks to power planet-scale applications. With the rise of generative AI and compound AI systems, end-to-end AI-powered applications increasingly span multiple domains, and GPU-accelerated systems will no longer be sufficient to meet the compute demands of next-generation, planet-scale GenAI applications.
To unleash the next wave of compute, we must move toward multi-acceleration. However, multi-acceleration will quickly become limited by the data delivery between accelerators. In this talk, I present the vision of data-centric multi-acceleration and how we aim to realize it by focusing on memory and data delivery specialization. I conclude by briefly introducing two of our recent works accepted to ASPLOS 2025 and ISCA 2025 that use specialized, compute-enabled memories to accelerate retrieval-augmented generation.

05.02.25: Birds of a Feather Flock Together: Scaling RDMA RPCs with FLOCK
Speaker: Muhammad Ahmed, Cornell
Host: Hakim Weatherspoon
Abstract: RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs.
In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.

05.09.25: Torch2Chip: An End-to-end Customizable AI Model Compression and Deployment Toolkit for Prototype Hardware Accelerator Design
Speaker: Jae-sun Seo, Cornell Tech
Host: Adrian Sampson
Abstract: AI model compression techniques like quantization and pruning have been widely explored for vision and language tasks, driven by the growth of AI hardware accelerators such as ASICs and FPGAs. While these methods aim to accelerate computations on low-power devices, current hardware-algorithm co-design faces challenges. Modern frameworks (e.g., PyTorch) only support fixed 8-bit precision, limiting flexibility. Many quantization methods also produce discretized floating-point values rather than low-precision integers, complicating hardware deployment. Existing compression toolkits remain limited to proprietary solutions, restricting customization for prototype hardware designs. To address this, we introduce Torch2Chip, an open-source, customizable, high-performance toolkit that supports user-defined compression algorithms, automatic model fusion, and parameter extraction. Torch2Chip delivers deployment-ready formats for a range of AI models and supports both supervised and advanced lightweight self-supervised learning.