Netsys group - Research

Next-generation Storage I/O Stack

Two recent trends in remote storage access:

High-performance storage devices (NVMe SSDs) and high-speed networks (40Gbps and beyond)
Disaggregated storage

Implication?

Performance bottlenecks pushed back to software stacks.
Increasingly higher overlap between storage and network data planes.

Our goal: Designing a new storage I/O stack

Achieving high throughput comparable to state-of-the-art NVMe-over-RDMA.
Supporting multiple tanents that have different requirements.

Problems

1. Low remote block I/O throughput

What is the current status?

Traditional iSCSI protocol requires:
- 14 CPU cores to saturate a 1M IOPS SSD.
- 56 CPU cores to saturate a 100Gbps link.
User-level approaches require:
- Changes in applications and/or networks.
NVMe-over-Fabrics:
- (RDMA) Changes network infrastructure.
- (TCP) Suffers low performance.

Technical challenge: “Can we push back the performance bottlenecks to the hardware, requiring no modifications in applications/networks?”

2. Multi-tenancy support

How to support various tenants’ requirements?

Each host/target server can have multiple tenants with different target devices, requirements, resources, and so on.

(As-Is) The current storage I/O stack creates per-core queues, and each I/O request takes a static data path to the target device on each core.
(To-Be) The request should be able to change its data path dynamically at low cost, in order to satisfy all tenants’ requirements (e.g., throughput, latency, etc.).

Technical challenge: “What would be the best abstraction to support multi-tenancy at block device layer?”

Papers

I10: A Remote Storage I/O Stack for High-Performance Network and Storage Hardware, USENIX NSDI’20.

Download

(To be updated..)

Members

References

User-level stacks:

Datacenter RPCs can be General and Fast, USENIX NSDI’19.
Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads, USENIX NSDI’19.
ReFlex: Remote Flash ~~ Local Flash, ASPLOS’17.
StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs, USENIX ATC’16.
IX: A Protected Dataplane Operating System for High Throughput and Low Latency, USENIX OSDI’14.
Arrakis: The Operating System is the Control Plane, USENIX OSDI’14.
Network Stack Specialization for Performance, ACM SIGCOMM’14.
mTCP: a Highly Scalable User-Level TCP Stack for Multicore Systems, USENIX NSDI’14.
Chronos: Predictable Low Latency for Data Center Applications, SOCC’12.

OS-level approaches:

Snap: a Microkernel Approach to Host Networking, ACM SOSP’19.
Scalable Kernel TCP Design and Implementation for Short-Lived Connections, ASPLOS’16.
StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs, USENIX ATC’16.
Flash Storage Disaggregation, EuroSys’16.
Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems, SYSTOR’13
MegaPipe: A New Programming Interface for Scalable Network I/O, USENIX OSDI’12.
The Case for VOS: The Vector Operating System, USENIX HotOS’11.
FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, USENIX OSDI’10.

RDMA-based solutions

SocksDirect: Datacenter Sockets can be Fast and Compatible, ACM SIGCOMM’19.

NVM Express:

NVMe (NVMe-oF) standards
NVMe-over-TCP
NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation, SYSTOR’17.
Enabling NVMe WRR support in Linux Block Layer, HotStorage’17.
Achieving 2.8M IOPS with 100Gb NVMe-oF, Kazan Networks, 2019.