Weijia Song    宋维佳

The creator of Cascade, a scalable low-latency AI/ML application platform that efficiently combinding data and computation. My research focuses on high-performance cloud application platforms. I'm fascinated with pushing system performance to hardware limitations. Collaborators are welcome!

Current Research Projects

Cascade/Derecho/DCCL
Cascade is a high-performance AI/ML application framework powered by optimized RDMA data paths. It exposes a standard K/V API compatible with existing platforms. It offers novel features (temporal data indexing, versioning, consistency, replication, and fault-tolerance); but does so through the existing APIs, yielding mechanisms that are either fully transparent or that can be used with at most small application changes. Rather than delaying processing to accumulate batches of work, Cascade schedules tasks to opportunistically collocate tasks. This legacy-compatible option performs well, but not the best possible. For time-sensitive event-triggered pipelines, Cascade has a heavily optimized critical path that maps directly to the best modes for the hardware.
    Derecho is a supercool library for building high performance replicated systems on RDMA network. Today’s platforms provide surprisingly little help to developers of high-performance scalable services. To address this, Derecho automates the creation of complexapplication structures in which each member plays a distinct role. Derecho’s data plane runs a novel RDMA-based multicast protocol that scales exceptionally well. The Derecho control plane introduces an asynchronous monotonic logic programming model, which is used to implement strong guarantees while providing a substantial performance improvement over today’s fastest multicast and Paxos platforms. A novel subgrouping feature allows any single group to be structured into a set of subgroups, or sharded in a regular pattern. These features yield a highly flexible and expressive new programming option for the cloud. Please visit our GitHub repo for the current Derecho source code. Derecho project is led by Kenneth P. Birman. I joined this project since late 2017.
    I also created Derecho Collective Communications Library (DCCL), an alternative of NCCL. It is built upon Derecho data path. DCCL enjoys better performance than OpenMPI due to the efficiency of DCCL's optimized data path. Particularly, DCCL AllReduce is ~40% faster than OpenMPI with Rabenseifner algorithm, and ~10% faster with Ring algorithm. Currently, DCCL supports host memory. GPU memory support is under construction.

CacheInspector
IaaS providers sell virtual machines that are only vaguely specified, in terms of number of CPU cores, amount of memory, NIC bandwidth, and I/O throughput. However, important details such as the use of hyperthreading, sizes of caches, and memory latency are not clearly specified, or reported in a way that makes them difficult to compare across cloud providers. As a first step toward uniform characterization of resources offered by cloud providers, we focused on modeling and measurement of CPU and memory resources. Experiments with virtual machines of various public and private IaaS clouds demonstrate results that are sometimes counterintuitive. We begin to model and measure other two types of resources: the disk I/O and network performance.
This is a joint work with Robbert van Renesse, Hakim Weatherspoon, Zhiming Shen, Lotfi Benmohamed, Frederic de Vaulx, and Charif Mahmoudi

The Freeze-Frame File System
Many applications perform real-time analysis on data streams. We argue that existing solutions are poorly matched to the need, and introduce our new Freeze-Frame File System. Freeze-Frame FS is able to accept streams of updates while satisfying “temporal reads” on demand. The system is fast and accurate: we keep all update history in a memory-mapped log, cache recently retrieved data for repeat reads, and use a hybrid of a real-time and a logical clock to respond to read requests in a manner that is both temporally precise and causally consistent. When RDMA hardware is available, the write and read throughput of a single client reaches 2.6G Byte/s for writes, 5G Byte/s for reads, close to the limit on the hardware used in our experiments. Even without RDMA, Freeze Frame FS substantially outperforms existing file system options for our target settings. To address the variety of IoT applications, we plan to extends the "Freeze-Frame" feature to more storage types including object stores and key-value stores. Please visit our GitHub repo for the current version of Freeze-Frame FS. FFFS features are now moving to Cascade/Derecho.
This is a joint work with Theodoros Gkountouvas, Kenneth P. Birman, Qi Chen, and Zhen Xiao.


Selected Publications(full publications in my CV)