Date: February 13, 2026
Title: Virtual Decoupled Cores: A Composable Runtime for Modern Async GPUs
Speaker: Zhiyuan Guo

Abstract: Modern GPUs expose increasing internal parallelism, including independent tensor and memory pipelines and asynchronous copy engines. Yet GPU kernels are still written as single SIMT programs that interleave memory movement, computation, and control. This forces developers and compilers to linearize fundamentally concurrent activities, making overlap brittle, non-composable, and tightly coupled to specific architectures.
We present virtual Decoupled Cores (vDC), a programming and execution model that separates memory, compute, and control and reconnects them only through explicit dependencies. vDC virtualizes warps into software-defined async memory and compute cores that communicate via queues and ports. This structure allows the compiler and runtime to schedule prefetching, buffering, and computation–communication overlap safely and automatically, turning what are today hand-tuned, architecture-specific optimizations into emergent behavior.
Beyond performance, vDC provides a substrate for systematic composability. Kernels become modular components that can be plugged into larger pipelines, enabling runtime scheduling changes without pre-launch fusion or kernel rewriting. In our evaluation of QWen-8B inference, vDC enables roughly 90% kernel reuse across variants, and delivers up to 8.4% performance improvements over state-of-the-art fused kernel systems.
Bio: Zhiyuan is a final year Ph.D. student at the University of California San Diego, major in Computer Science.
He is now a part of the Wuklab and SysNet group in UCSD, and advised by Prof. Yiying Zhang. His focus is on building efficient, performant, and scalable next generation datacenter systems, through resource disaggregation and co-design of application, software and hardware stacks. Zhiyuan's research interests span Operating Systems, Distributed Systems, Computer Architecture, and Programming Languages.