U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Notes by Xun Wilson Huang
01/01/02

This paper is motivated by the fact that data processing in the end host has become the bottleneck for latency because of the advances in high-speed LAN. This paper aims to reduce the latency in the end host and provide flexibility in protocol building for applications by introducing a new architecture, U-Net. This paper argues that the entire kernel should be removed from the communication path.

In the simplest sense, all U-Net provides to the user are an interface to the device driver and a shared buffer between the user space and kernel. User is supposed manage the buffer on his own in building protocol stacks.

Latency of a network stack consists of the following:

context switch - this no one can get away with. U-Net needs to make system calls in channel creation, tear down, informing kernel data is available for send in the shared buffer and making up calls when data arrives.
data copy between user-space and kernel ( copyin() and copyout() functions). U-Net avoids this by providing a shared buffer allocated by the kernel and restricts the user to use this buffer only.
processing of data by different layers of protocol. This overhead comes with the functionality provided by these protocols, whether the protocol is in user space or kernel space, this is not avoidable.

How does U-Net work?

User application makes a system call to create an endpoint, this server as a handle to the network.
Application setup channels to demultiplex packets destined for the same endpoint.
Along with the endpoint, the application get a buffer, which is shared between the user space and the kernel.
Application compose the data that it wish to send in that buffer area and compose a descriptor for that data segment and push the descriptor on to the tx queue.
Application traps into the kernel to reflect that something is in the tx queue.
On the receive side, data can only arrive in the buffer came with the endpoint. The application can either poll the rx queue or register a upcall with the module in the kernel for an asynchronous notification.

Zero-copy

Normally, there are 2 copies with normal operating systems doing a send().

Copying from user space to kernel space.
Copying from kernel to the device's buffer.

The second copy can not be avoided without hardware support, like the SBA-200's firmware. Therefore this trick from direct access U-Net is not universally applicable.

The first copy U-Net gets away with restricting the user with the buffer that is provided by the U-Net at endpoint creations. This restriction makes it very inconvenient for the user application, and this leads to a later paper "U-Net with buffer management", which pins user memory down as a send request come in. However, what does send() in TCP really mean in this zero-copy architecture? In the traditional OS, " TCP send(buf.. ) returns" means the content of the buffer is copied into the OS's buffer, returning from send() means the buffer can be reused and the content of buf will be sent eventually ( assuming no abnormality occurs). But in U-Net's zero copy structure, the buffer cannot be reused until an ack is received, therefore the application either have to ask for another buffer through its own buffer management or simply block for the ack. For UDP or RPC this works quite well but for TCP copy is still required.

Critiques and questions:

"remove kernel completely from the critical path"? what about system-wide resources that needs to be shared among different applications. TCP/UDP port number space? ARP table? routing table? where should these things be kept? .
Flexibility. To build new communication protocols, one can use raw socket together with a mechanism for asking kernel to allocate buffers in the kernel. To customize protocols, I think it's better to have the OS provide a protocol layering mechanism for easy insertion and by-pass, rather than having the entire protocol stack (customized) appear in the user space. For mortals like me, moving the protocol stack up to the user space is not easy.
For easy developing and debugging of network protocols, one can consider the SurReal instead, the new network simulator currently being developed at Cornell.
Tagging the packet. Tags are used for U-Net to demultiplex incoming packet into different endpoints. This makes it impossible for U-Net to communicate with a regular stack. This motives the work on packet filtering later.
This paper did not address how to identify machines with U-Net. And in the TCP implementation, it says "module not in the criticial performance path such as ARP are not ported to U-Net". 1. Looking up an arp entry is in the critical path. 2. Without ARP, U-Net is identifying machines with hardware address?
This paper also argues that having stack in the user space can allow easy query of the network param for feedbacks. This type of query can be done in a traditional OS with getsockopt(), ioctl().
Having stack in the user space also allows the user application to corrupt the stack. No queuing pocliy is addressed in the paper. If FCFS is used, the corruptions of one stack can affect everyone else.