CS 6410: Advanced Systems

CS 6410 is an advanced course in computer systems targetted to students interested in systems as a graduate research area. Topics may include systems specification, systems measurement, communication protocols, consistency in distributed systems, fault tolerance, knowledge and knowledge-based protocols, performance, scheduling, concurrency control, and security issues. As a Cornell CS PhD student, you are required to demonstrate basic competence in systems at the CS 4410 level, take a systems breadth course, and take at least one 6xxx course in the “systems style”. CS6410 can satisfy all of these requirements.
Prerequisites: CS 6410 is open to any CS PhD or MS student, as well as, with permission of the instructor, students who have mastered the material in CS 3410 or 3420 (ECE 3140) and CS 4410.

Inclusion

  • We strive to make CS6410 a welcoming, safe, equitable, and respectful environment, consistent with Cornell’s commitments
  • We recognize that the society we live in is none of those things, that we have implicit biases, and that we have to work hard every day to counter those biases to create an inclusive environment
  • If you witness a bias incident or have been the victim of one, please file a confidential report with Cornell
  • If you have any suggestions such as improvements to the web site, syllabi, slides, homework, and so on, you can email cs6410-prof@cornell.edu

Lecture

Lectures take place on Tuesdays and Thursdays, 8:40am - 9:55am in 114 Gates Hall and in Bloomberg Center 497 over video. Attendance and active participation at each lecture is expected.

Communications

  • We will use CMSX for submissions and grades
  • We will use Ed Discussion for questions and answers
  • For time sensitive matters, please email cs6410-staff
  • For sensitive matters, please email cs6410-prof
  • Please do not contact any course staff or instructors via their email addresses, facebook, texting, etc. for matters concerning this course.

Homework

  • Academic integrity:
    • Homework can only be discussed within the study group that you're in without outside help other than from the teaching staff
    • Do not look at code that is not by your study group
    • Do not share your study group's code with anybody
    • OK to discuss concepts with students in other groups
    • Violations will be prosecuted
  • There is no exam

Fall 2025 Course Schedule

Need to be on campus, or use VPN to access some papers. Or, change “dl.acm.org/” to “dl-acm-org.proxy.library.cornell.edu/” in the URL.

Week Class # Day Date Topic Reading Notes
1 1 Tue Aug 26 Administrivia How to read a paper
2 Thu Aug 28 Classical systems; intro to first programming project Epidemic algorithms for replicated database maintenance by Alan Demers et al. Due Aug 28, 7.30am
2 3 Tue Sep 2 Building Large, Principled Systems End-to-end arguments in system design by Jerry H. Saltzer et al.

Hints for computer system design by Butler W. Lampson
Due Sep 2, 7.30am
4 Thu Sep 4 Classical Systems The UNIX time-sharing system by Dennis M. Ritchie et al.

The structure of the “THE”-multiprogramming system by Edsger W. Dijkstra
Due Sep 4, 7.30am
3 5 Tue Sep 9 Classic File Systems Required: The design and implementation of a log-structured file system by Mendel Rosenblum et al.

Optional: A fast file system for Unix by William N. Joy et al.
Due Sep 9, 7.30am
6 Thu Sep 11 Concurrency, Threads, and Events On the duality of operating system structures by Hugh C. Lauer et al.

Capriccio: scalable threads for internet services by Rob van Behrn et al.
Due Sep 11, 7.30am
4 7 Tue Sep 16 µ-kernels Required: The performance of μ-kernel-based systems by Hermann Härtig et al.

Optional: Mach: a foundation for open systems (operating systems) by Richard Rashid et al.
Due Sep 16, 7.30am
8 Thu Sep 18 Extensible Kernels (Lindsey) Required: Exokernel: an operating system architecture by Dawson R. Engler et al.

Optional: Extensibility, Safety and Performance in the SPIN Operating System by Brian N. Bershad et al.
Due Sep 18, 7.30am
5 9 Tue Sep 23 Virtualization (Irene) Required: Xen and the art of virtualization by Paul Barham et al.

Optional: The Origin of the VM/370 Time-Sharing System by R. J. Creasy
Due Sep 23, 7.30am
10 Thu Sep 25 Many Cores (Yifan) Required: An Analysis of Linux Scalability to Many Cores by Silas Boyd-Wickizer et al.

Optional: Corey: An Operating System for Many Cores by Silas Boyd-Wickizer et al.
Due Sep 25, 7.30am
6 11 Tue Sep 30 seL4 (Ernest) Required: seL4: formal verification of an OS kernel by Gerwin Klein et al.

Optional: Shielding Applications from an Untrusted Cloud with Haven by Andrew Baumann et al.
Due Sep 30, 7.30am
12 Thu Oct 2 Gossip Program Evaluation Session
7 13 Tue Oct 7 Clocks / NTP (Ben) Required: Time, clocks, and the ordering of events in a distributed system by Leslie Lamport

Optional: Internet time synchronization: the network time protocol by David L. Mills
Due Oct 7, 7.30am
14 Thu Oct 9 Chain replication / Fault tolerance (Jacqueline) Implementing fault-tolerant services using the state machine approach: a tutorial by Fred B. Schneider

Chain replication for supporting high throughput and availability by Robbert van Renesse et al.
Due Oct 9, 7.30am
8 - Tue Oct 14 Fall Break
15 Thu Oct 16 Distributed Systems: FLP (Gilad) Impossibility of distributed consensus with one faulty process by Michael J. Fischer et al. Due Oct 16, 7.30am
9 16 Tue Oct 21 Paxos (Jorge) Required: Paxos Made Simple by Leslie Lamport

Optional: In search of an understandable consensus algorithm by Diego Ongaro et al.
Due Oct 21, 7.30am
17 Thu Oct 23 Byzantine Fault Tolerance (Daniel L.) Required: The Byzantine Generals Problem by Lamport et al.

Optional: Practical Byzantine fault tolerance by Miguel Castro et al.
Due Oct 23, 7.30am
10 18 Tue Oct 28 Chord (Chengyu) Required: Chord: A scalable peer-to-peer lookup service for internet applications by Ion Stoica et al.

Optional: The impact of DHT routing geometry on resilience and proximity by Krishna Gummadi et al.
Due Oct 28, 7.30am
19 Thu Oct 30 Dynamo (Muhammad) Dynamo: amazon's highly available key-value store by Gieseppe DeCandia et al. Due Oct 30, 7.30am
11 20 Tue Nov 4 Google File System (Jamal) Required: The Google file system by Sanjay Ghemawat et al.

Optional: Spanner: Google’s Globally Distributed Database by James C. Corbett et al.
Due Nov 4, 7.30am
21 Thu Nov 6 MapReduce / Spark (Julian) MapReduce: simplified data processing on large clusters by Jeffrey Dean et al.

Apache Spark: a unified engine for big data processing by Matei Zaharia et al.
Due Nov 6, 7.30am
12 22 Tue Nov 11 Serverless / OpenLambda (Munachimso) Occupy the cloud: distributed computing for the 99% by Eric Jonas et al.

Serverless computation with openLambda by Scott Hendrickson et al.
Due Nov 11, 7.30am
23 Thu Nov 13 Large Scale Data Analytics (Daniel E.) Required: Pregel: a system for large-scale graph processing by Grzegorz Malewicz et al.

Optional: TensorFlow: a system for large-scale machine learning by Martin Abadi et al.
Due Nov 13, 7.30am
13 24 Tue Nov 18 Congestion control (Anshuman) Congestion avoidance and control by Van Jacobson Due Nov 18, 7.30am
25 Thu Nov 20 Software-Defined and Programmable Networks (Sheng-Yen) Required: Arrakis: The Operating System Is the Control Plane by Simon Peter et al.

Optional: OpenFlow: enabling innovation in campus networks by Nick McKeown et al.
Due Nov 20, 7.30am
14 26 Tue Nov 25 Bitcoin Bitcoin: A Peer-to-Peer Electronic Cash System by Satoshi Nakamoto Due Nov 25, 7.30am
- Thu Nov 27 Thanksgiving Break
15 27 Tue Dec 2 Final student project presentations TBD Due Dec 2, 7.30am
28 Thu Dec 4 Final student project presentations TBD Due Dec 4, 7.30am

Resources

Course Management

  • Assignments will be collected via CMS

List of example ‘seminal papers’

  • Epidemic algorithms for replicated database maintenance
  • Paxos made simple
  • Bitcoin: A Peer-to-Peer Electronic Cash System
  • End-to-end arguments in system design
  • The Design and Implementation of a Log-Structured File System
  • On the duality of operating system structures
  • Exokernel: an operating system architecture for application-level resource management
  • Xen and the Art of Virtualization
  • Running Commodity Operating Systems on Scalable Multiprocessors
  • The Google file system
  • Spanner: Google’s Globally Distributed Database
  • MapReduce: Simplified Data Processing on Large Clusters
  • Time, Clocks, and the Ordering of Events in a Distributed System
  • Distributed snapshots: determining global states of distributed systems
  • Implementing fault-tolerant services using the state machine approach: A tutorial
  • Chain replication for high throughput and availability
  • Impossibility of Distributed Consensus with One Faulty Process
  • Chord: A scalable peer-to-peer lookup service for internet applications
  • Dynamo: Amazon’s Highly Available Key-Value Store
  • Congestion Avoidance and Control
  • seL4: formal verification of an OS kernel
  • An Analysis of Linux Scalability to Many Cores
  • The Multikernel: A new OS architecture for scalable multicore systems
  • Capriccio: Scalable Threads for Internet Services
  • The benefits and costs of writing a POSIX kernel in a high-level language
  • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
  • Apache Spark: A Unified Engine for Big Data Processing
  • Pregel: A System for Large-Scale Graph Processing
  • TensorFlow: A System for Large-Scale Machine Learning
  • Serverless Computation with OpenLambda
  • Occupy the Cloud: Distributed Computing for the 99%
  • Arrakis: The Operating System is the Control Plane
  • OpenFlow: Enabling Innovation in Campus Networks
  • TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones
  • Internet Time Synchronization: The Network Time Protocol

Textbooks

The course is mostly based on published papers.

Suggested Additional readings:

Office Hours

We look forward to seeing you in office hours! Check out the availability of the course staff below:

DayTimeWhereWho
By appointmentBy appointmentGates 427 or ZoomHakim Weatherspoon
By appointmentBy appointmentGates 440 or ZoomSalman Abid

Meet the Course Staff

Instructor

hakim-img
Hakim Weatherspoon
(he/him)
Professor
Hometown
Ithaca, NY
Hakim Weatherspoon received his PhD in 2006 from the University of California, Berkeley, in the area of secure and fault-tolerant distributed wide-area storage systems (e.g. Antiquity, OceanStore, etc.).

Graduate TA(s)

Salman Abid
(he/him)
CS PhD Student
Hometown
Karachi, Pakistan

LICENSE

Project Title

Project Image

Replace with an architecture image or project symbol.

Team Members

Your names here

Abstract

Abstract content goes here.


Project Files


Projects

Projects can only be accessed from within the Cornell University network or via VPN.

Hierarchical and Programmable Packet Scheduling with PIEO Trees

Team Members

Anshuman Mohan

Abstract

This course project explores the design of PIEO trees, a new data structure for implementing hierarchical packet scheduling policies that mix work-conserving and non-work-conserving policies. PIEO trees are more expressive than PIFO trees, which are another data structure that implement hierarchical work-conserving scheduling policies.

PIEO trees have a cleaner semantics than a prior attempt to use multiple PIEOs for hierarchical scheduling. PIEO trees may suffer a performance or area cost compared to those two alternatives, and this project will eventually quantify that cost.


Project Files


Partitioning for Memory Efficient Search Graphs

Project Diagram

A diagram illustrating how the proposed construction reduces the number of neighbors stored by a search graph by merging and deduplicating the neighborhoods of points in a partition.

Team Members

Ben Landrum

Abstract

Search graphs represent the state of the art in high-recall approximate nearest neighbor search over dense vectors, am important primitive for information retrieval and RAG-based applications. Their major weaknesses come from large index sizes and poor data locality at query time. We propose a technique that compresses the graph with partitioning to reduce the memory footprint and leverage fast linear reads at query time.


Project Files


Efficient vLLM Inference with Lazy KV Block Eviction

Project Image

Replace with an architecture image or project symbol.

Team Members

Chengyu Huang

Abstract

The use of decoder-only transformers makes Large Language Models powerful but also renders them inefficient. In particular, computing the attention matrix incurs $O(N^2)$ memory and computational costs. vLLM is proposed to mitigate this issue through efficient KV caching. Inspired by logical memory pages, it segments the KV caches of a text sequence into logical KV blocks, which can be mapped to physical blocks in GPU memories, with minimal overhead. However, when the memories are full, vLLM uses a naive all-or-nothing mechanism that evicts the KV caches of an entire existing sequence to make space for new incoming caches from a high-priority request. This project explores a more efficient eviction mechanism that lazily evicts the minimal number of existing KV blocks when new space is needed. We propose two eviction strategies: (1) We only evict a small fixed number of KV blocks; (2) We use a conservative estimate of the future completion length of the high-priority request and calculate the number of blocks to evict. We run simulation experiments on CPU and compare our method against the native vLLM. We observe higher throughput (tokens/s) and lower eviction cost compared with native vLLM, which shows the potential of our methods.


Project Files


Project Title

Project Image

Team Members

Daniel Enriquez

Abstract

Augmented Reality (AR) glasses present unique opportunities for researchers creating geo-located and context-aware applications, yet they pose challenges for researchers aiming to understand user interactions and decision-making. Due to the lightweight, power-constrained nature of optical see-through AR devices, computational resources are primarily devoted to localization and application memory, leaving limited capacity for data collection. This limitation hinders the ability of interaction researchers to capture high-fidelity logs of application information. While existing literature has explored lightweight datastreams in wearable biosensing devices, there remains a significant gap in accessible, generalizable data-logging frameworks for AR glasses. To address this gap, this work proposes the development of a lightweight data-logging system tailored for AR glasses, designed to efficiently record and synchronize relevant interaction data streams without compromising device performance. This system aims to support the broader research community by enabling robust, scalable, and replicable data collection in AR glasses-based user studies..


Project Files


Sintr: Safe Interactive Transactions in the presence of byzantine clients

Project Image

Team Members

Daniel Lee, Austin Li, Florian Suri-Payer, Natacha Crooks, Lorenzo Alvisi

Abstract

This paper presents Sintr, a framework for client-driven transactions executed in a byzantine environment. Interactive transactions (client-driven transactions) are often preferred by developers due to their flexibility and ease of use. However, ensuring that application semantics for interactive transactions are upheld when clients can be byzantine is a non-trivial problem. Sintr aims to solve this by having multiple clients perform redundant execution of transactions. Once a transaction acquires the necessary amount of matching results specified by some data policy, it is allowed to commit to the underlying database. Sintr aims to add minimal overhead to existing BFT databases while guaranteeing correct application semantics in the presence of byzantine clients.


Project Files


Protocols: A Language for Testing and Debugging Hardware Modules

Team Members

Ernest Ng

Abstract

Composing hardware modules is notoriously error-prone: different components have varying timing requirements for control and data signals. However, existing testing tools conflate the communication protocol of hardware modules from the data being communicated. Moreover, current waveform debugging tools only display low-level signal values at the level of clock cycles, making it difficult for engineers to localize higher-level, transaction-level bugs. We propose a domain-specific language (DSL), Protocols, in which users can write executable specifications of hardware modules’ communication behavior. Our DSL comes equipped with a monitor, which infers a transaction-level trace from waveform data based on the Protocols specification. We discuss design and implementation considerations for the monitor, and demonstrate the monitor’s utility as a tool to facilitate waveform debugging on a range of examples taken from the literature.


Project Files

  • Paper: PDF
  • Presentation: PDF
  • Source Code: TGZ

OCEAN: Optimizing spgemm on gpus through Cardinality EstimAtioN and persistent kernel design

Project Image

OCEAN

Team Members

Julian Bellavita, Yifan Li, Irene Simo Munoz

Abstract

Sparse general matrix multiplication (SpGEMM) is a kernel that computes the product of two sparse matrices and produces a sparse output matrix. SpGEMM parallelization presents more challenges than dense matrix multiplication due to the irregular structure of the data. Because all matrices involved are sparse, i.e. they contain only a small fraction of nonzero entries, SpGEMM typically exhibits irregular memory access patterns. This irregularity complicates efficient workload distribution and introduces significant overhead, making the kernel memory-bound and lowering its arithmetic intensity. This work addresses current challenges by developing a cache-aware SpGEMM framework explicitly tailored to modern GPU architectures in three different aspects (1) exploring new algorithms to reduce unnecessary symbolic computation and accelerate execution like the HyperBitBit probabilistic data structure, and (2) designing a persistent kernel (a single GPU kernel) that handles all communication and computation and thus can efficiently compute SpGEMM in a distributed memory environment.


Project Files

  • Paper: PDF
  • Presentation: PDF
  • Source Code: TGZ

Spectral Stability in Lipschitz Transformers Improves Quantization Robustness

Abstract

Neural network quantization reduces memory by representing weights in low-precision formats, but often degrades performance—especially at aggressive compression levels. We investigate whether Lipschitz-constrained transformers, trained with spectral norm regularization, exhibit improved quantization robustness. Through experiments on a 10M parameter character-level language model, we discover that Lipschitz constraints improve quantization not by suppressing weight outliers, but by preserving spectral structure during quantization. At 2-bit precision, Lipschitz models maintain near-zero spectral norm change (Δσ ≈ 0) while baseline models show significant degradation (Δσ = +0.069). This translates to 5× better loss preservation: Lipschitz models increase validation loss by only 0.08 nats versus 0.41 nats for baseline models.


Project Files


Overview

This project explores a training-time approach to quantization robustness. Rather than developing new quantization algorithms, we investigate whether training transformers with enforced spectral norm constraints naturally produces quantization-friendly weights. This allows flexible post-training precision selection without specialized quantization methods.

Key Finding: Spectral Stability > Outlier Suppression

Conventional wisdom suggests that quantization difficulty stems from weight outliers—extreme parameter values that force quantization grids to span wide ranges. Our results challenge this assumption. We find that Lipschitz models actually have larger outliers (14.0 vs 9.5 at 2-bit) yet demonstrate superior performance. The true mechanism is spectral stability: preservation of weight matrix spectral structure under quantization.


Experimental Results

1. Compression Ratios Are Equal

Compression Ratio vs Quantization Level

Both Lipschitz-constrained and baseline models achieve identical compression ratios across bit-widths. At 2-bit, both achieve ~4× compression; at 8-bit, ~4× compression (from 32-bit baseline). This confirms the quantization scheme affects both models equally in terms of storage savings—any performance differences must stem from geometric properties of the weights, not compression efficiency.

2. Lipschitz Models Preserve Performance Under Aggressive Quantization

Loss Change vs Quantization Level

This figure reveals the core result: 5× improvement in robustness at 2-bit precision. The Lipschitz model increases loss by only 0.08 nats while the baseline increases by 0.41 nats. At 4-bit and above, both models show minimal degradation (<0.01 nats), confirming that 4-bit is sufficient regardless of training procedure. The Lipschitz advantage emerges specifically at aggressive compression (2-bit), where geometric effects become dominant.

3. The Mechanism: Spectral Structure Preservation

Spectral Norm Change vs Quantization

This is the key mechanistic insight. At 2-bit quantization, the Lipschitz model achieves Δσ ≈ 0 (spectral norm preserved), while the baseline shows Δσ = +0.069 (~7% increase). For a linear layer y = Wx, the spectral norm σ_max(W) determines maximum signal amplification. If σ_max(Ŵ) ≈ σ_max(W) after quantization W → Ŵ, the layer’s geometric transformation is preserved despite parameter perturbations. Spectral degradation means the layer amplifies inputs differently than trained, corrupting the learned function.

Why does this matter? The condition number κ = σ_max/σ_min governs quantization error (error scales as κ²). Lipschitz training not only creates low κ initially through bounded spectral norms, but critically κ remains stable after quantization because σ_max is preserved.

4. Surprising Result: Outliers Don’t Determine Robustness

Weight Outliers vs Quantization

This contradicts conventional wisdom. At 2-bit, the Lipschitz model exhibits maximum absolute weight value of 14.0 versus 9.5 for the baseline—47% larger outliers. Yet simultaneously, the Lipschitz model demonstrates 5× better performance. This reveals that spectral structure may matter more than individual parameter magnitudes.

Mathematical insight: Spectral norm bounds individual entries loosely. A rank-1 matrix W = uv^T with ||u|| = ||v|| = 1 has σ_max = 1, but entries W_ij = u_i v_j can approach 1. Two matrices can have identical spectral norms but very different outlier patterns. If both preserve their spectra under quantization, both preserve their learned functions—regardless of outlier magnitude differences.


Technical Approach

Lipschitz-Constrained Architecture

Building on recent work (Newhouse et al., 2025), we use:

  • Muon optimizer: Orthogonalizes gradient updates via Newton-Schulz iteration, producing bounded spectral norm changes
  • Spectral soft capping: Polynomial approximation of σ → min(σ_max, σ) applied to all singular values, maintaining strict bounds throughout training
  • Modified attention: 1/d scaling (instead of 1/√d) for bounded Lipschitz constant
  • Convex residual connections: (N-1)/N · x + 1/N · block(x) to prevent exponential depth scaling
  • No normalization layers: Removed LayerNorm to maintain Lipschitz continuity

Quantization Procedure

Simple symmetric quantization (no calibration, no mixed precision, no specialized algorithms):

W_q = clip(round(W / s), -q_max, q_max)

where scale s = max(|W|) / q_max for per-tensor, or s_i per output channel for per-channel.

Storage and computation paradigm:

  • Weights stored in low-precision (int8, int4, etc.) for memory savings
  • Dequantized to bfloat16 before computation: Ŵ = (s · W_q)_bfloat16
  • Provides memory benefits (larger batches, reduced storage) but minimal speed improvements

Key Contributions

  1. First empirical connection between training-time spectral regularization and quantization robustness for transformers

  2. Mechanistic insight: Spectral stability (Δσ ≈ 0) under aggressive quantization, while baseline models show spectral degradation

  3. Paradigm challenge: Lipschitz models have larger outliers yet better performance, revealing spectral structure preservation as the key mechanism rather than outlier suppression

  4. Practical approach: Training-time constraints enable flexible post-training quantization without specialized algorithms—train once with spectral constraints, then select precision based on deployment needs


Implications

Training-Time vs Post-Training Quantization

Our approach occupies middle ground between QAT (quantization-aware training) and PTQ (post-training quantization). We make architectural and optimization decisions during training (Lipschitz constraints), but can then flexibly quantize to various precisions after training without retraining. This differs from:

  • QAT: Requires expensive retraining for each target precision
  • PTQ: Operates on fixed pre-trained weights, often with larger accuracy drops
  • Post-hoc spectral normalization: Simply normalizing weights after training doesn’t create the stable spectral properties we observe

When Do Lipschitz Constraints Help?

  • Aggressive compression (2-bit): 5× robustness improvement
  • Moderate compression (4-bit+): Minimal difference—4-bit sufficient regardless of training
  • Small models: Our 10M parameter model requires extreme quantization to show effects; larger models may reveal benefits at higher bit-widths
  • Memory-constrained deployments: Most valuable when 2-3 bit weights are necessary

Limitations

  • Simple quantization only: We use basic symmetric quantization to isolate spectral effects. Combining with sophisticated methods (mixed precision, learned codebooks) could yield additional benefits
  • Memory vs speed: Provides storage savings but minimal computational speedup (since we dequantize to bfloat16). Future work: kernels that compute directly in low precision
  • Storage-and-dequantize paradigm: Results specific to this approach; methods computing in quantized precision may show different trade-offs

Exploiting Gradient Locality for Efficient LLM Fine-Tuning

Project Image

Team Members

Jamal Hashim (working with Alicia Yang and Shouxu Lin, advised by Ken Birman and Chris De Sa)

Abstract

Large language model (LLM) fine-tuning is often constrained by memory limitations, requiring expensive multi GPU setups even for moderate-scale jobs. We observe that gradient magnitudes in LLM fine-tuning follow a power law distribution, with the top 1\% of gradients accounting for 70+\% of the total gradient magnitude. These "hot" gradients exhibit strong spatial locality, concentrating in specific rows and columns of weight matrices in the QKV and MLP layers of LLMs. They also demonstrate strong temporal locality with the same set of gradients staying "hot" throughout a full training run. In this paper, we present a system that exploits this gradient locality to reduce memory consumption during fine-tuning by selectively computing and storing gradients only for "hot" parameters. Our approach profiles gradient magnitudes for 5\% of training iterations, then either maintains a fixed set of hot parameters (for simple tasks) or employs an exponentially weighted hot-swapping strategy (for complex tasks) that dynamically rotates parameters through GPU memory. By eliminating gradient computation and optimizer state storage for cold parameters, our method reduces memory usage to approximately 40\% of standard full fine-tuning while nearly maintaining full accuracy. This work enables full fine-tuning on memory-constrained hardware and provides an alternative to parameter-efficient methods for tasks where full fine-tuning demonstrates superior performance.


Project Files

  • Paper: PDF
  • Presentation: PDF
  • Source Code: TGZ

Use of Automated Testing Framework for KubestellarUI

Project Image

Architecture of the LLM set-up for log analysis.

Team Members

Jorge Tapias Gomez

Abstract

System logs that record the runtime status is a very common strategy that most computer systems use for timely identifying malfunctions and resolving them. However, manually detecting anomalies for logs is very time-consuming, and error-prone, making it infeasible for most cases. Given that these logs are an excellent source of information for monitoring and given that they use a subset of the human language to describe the runtime we propose to automatically learn log patterns from normal execution to detect anomalies using LLMs. Unfortunately, these are data-hungry in a field where it is hard and time consuming to collect a lot of data, furthermore, different system will log things different. Given the recent successes and impressive adaptability of LLMs we propose to explore its OOD performance on the publicly available loghub dataset to build a benchmark of the already available LLMs. Furthermore, we explore whether it is important to create system-specific AI methods for log analysis or if already available methods are good enough. Experimental evaluations show that modern LLMs, even without task-specific fine-tuning, achieve strong performance, reaching $F_1$ scores of 0.85 and 0.857 on the BGL and Thunderbird datasets, respectively, although they remain slightly below the performance of fine-tuned state-of-the-art LLM-based approaches.


Project Files


Project Title

Project Image

Team Members

Jacqueline Wen

Abstract

In this research project, we will leverage existing automated unit-test generation frameworks (such as ByteDance's nxt_unit) to automatically generate unit tests for KubestellarUI. Since KubestellarUI interacts closely with the underlying KubeStellar core by retrieving and displaying data, errors discovered through automated testing of the UI may also reveal errors in the core system itself. In this way, automated testing serves not only as a tool for improving the reliability of the interface, but also as a mechanism for uncovering systemic issues that might otherwise remain hidden.


Project Files

  • Paper: PDF
  • Presentation: PDF
  • Source Code: TGZ

Investigating the Causes of Communication Bottlenecks in Distrbuted ML Training

Project Image

Team Members

Lindsey Bowen

Abstract

The scale of modern large language models (LLMs) necessitates memory usage that far exceeds the capacity of any single GPU, requiring training and inference to be distributed across thousands of GPUs. However, current parallelization strategies, such as tensor, pipeline, and data parallelism, introduce frequent synchronization points that requires all participating GPUs to reach the same point before continuing to the next training step. Consequently, this limits overall performance to that of the slowest GPU in the group, creating what is known as the "straggler effect". The goal of this work is to identify the leading causes of the amplification of the straggler effect amongst homogeneous GPUs, and to propose possible mitigating solutions based on the experimental results.


Project Files


Traloc: Speeding up Distributed Transactions through Transactional Locality with Application-Assisted Sharding

Project Image

Team Members

Muhammad Ahmed

Abstract

This paper presents a hybrid solution to exploiting transactional locality: a programming model that captures application-level causality through an explicit API. Develop- ers annotate transactions as binding and unbinding, signaling the start and end of semantically coherent access patterns be- tween two or more entities. I argue that in many applications, once two or more entities bind together, they will engage in only a limited set of possible of transactions, making it possi- ble to preemptively colocate relevant data before the workload materializes. Such lightweight annotations enable significant improve- ments in transaction latency and throughput, especially under workloads with mixed or non-uniform locality, such as cellu- lar handover workloads and popular P2P marketplaces like Uber. This approach enables systems to move from merely observing locality to proactively capturing its cause — im- proving adaptability, performance, and semantic alignment with the application.


Project Files

  • Paper: PDF
  • Presentation: PDF
  • Source Code: TGZ

Ursa: End-to-End Multicluster Test Framework

Ursa_Image

Automated End to End Testing Framework For Multi-Cluster Orchestration

Team Members

Munachimso Nwaiwu

Abstract

The AI boom is driving a massive demand for computing power, leading to a rapid expansion of data centers. To handle this scale, we are moving toward Multi-Cluster Orchestration platforms (like KubeStellar, Slurm) to manage workloads across multiple clusters. We can no longer manage these clusters independently; we need software to do it. However, if this orchestration software fails, critical AI workloads crash. Currently, developers struggle to verify that these platforms are reliable. Manual testing is too slow, testing in production is too risky, and current integration tests lack scope. Even existing automated tests are limited because they typically require developers to manually define test inputs. This is usually insufficient for robust testing, as these inputs often fail to cover the full scope of system capabilities. They tend to focus on “happy paths” rather than exhaustive scenarios, failing to catch the messy, complex bugs (like partial updates or “Zombie States”) or edge cases that break systems in the real world.

We introduce Ursa, a fully automated validation framework designed to close this gap. Unlike traditional tools, Ursa utilizes Input Space Partitioning (ISP) to systematically model and generate diverse test scenarios. It employs a novel Paired-Sequencing Algorithm to test complex lifecycle transitions (Create - Update - Delete) and validates the outcomes using a Policy Ledger Oracle. This oracle maintains a “Shadow State” of the fleet, enabling the detection of subtle failures (e.g., partial updates, drift, and blast radius violations) that standard status checks miss. We demonstrate that Ursa can automatically detect state bugs that would otherwise go unnoticed. This framework lays the foundation for ensuring our multi-cluster infrastructure is resilient enough to handle the next generation of AI applications. Furthermore, this work serves as a fundamental first step toward building a testing framework capable of fault injection.


Project Files


CS6410 Advanced Systems - Fall 2025