Abstract: Traditional operating system interfaces and network protocol implementations force system state to be kept on both sides of a connection. Such state ties the connection to an endpoint, impedes transparent failover, permits denial-of-service attacks, and limits scalability. This paper introduces a novel TCP-like transport protocol and a new interface to replace sockets that together enable all state to be kept on one endpoint, allowing the other endpoint, typically the server, to operate without any per-connection state. Called Trickles, this approach enables servers to scale well with increasing numbers of clients, consume fewer resources, and better resist denial-of-service attacks. Measurements on a full implementation in Linux indicate that Trickles achieves performance comparable to TCP/IP, interacts well with other flows, and scales well. Trickles also enables qualitatively different kinds of networked services. Services can be geographically replicated and contacted through an anycast primitive for improved availability and performance. Widely-deployed practices that currently have client-observable side effects, such as periodic server reboots, connection redirection, and failover, can be made transparent, and perform well, under Trickles. The protocol is secure against tampering and replay attacks, and the client interface is backwards-compatible, requiring no changes to sockets-based client applications.
Figure 2 depicts the exchange of packets between the two ends of a typical TCP or Trickles connection. For simplicity, there is no loss, packets arrive in order, and delayed acknowledgments are not used. Except where the current window size (cwnd) increases (at times A and B), the receipt of one packet from the client enables the server to send one packet in response, which in turn triggers another packet from the client, and so on. This sequence of related packets forms a trickle.
The round-trip delay in state updates makes it challenging to match the congestion control action of TCP. Trickles circumvents the delay by using prediction. When a packet arrives at the server, the server can only know about packet losses that happened one full window earlier. It optimistically assumes that all packets since that point have arrived successfully, and accordingly makes the decision to continue, split, or terminate. Optimism makes the common case of infrequent packet loss work well.
With any prediction scheme, it is sometimes necessary to recover from misprediction. Suppose a packet is lost before it reaches the server. Then the server does not generate the corresponding response packet. This situation is indistinguishable from a loss of the response on the server to client path: in both cases, the client receives no response (Figure 4). Consequently, a recovery mechanism for response losses also suffices to recover from request packet losses, simplifying the protocol. Note, however, that Trickles is more sensitive to loss than TCP. While TCP can elide some ACK losses with implicit acknowledgments, such losses in Trickles require retransmission of the corresponding request and data.![]()
Figure 4: Equivalence of reverse and forward path loss in Trickles. Due to dataflow constraints, the packet following a lost packet does not compensate for the loss immediately. Neither the server nor the client can distinguish between (A) and (B). The loss will be discovered through subsequent SACK proofs.
The TCPCwnd(k) formula is directly valid only for connections where no losses occur. A connection with losses can be partitioned at the loss positions into multiple pieces without losses; TCPCwnd(k) is valid within each individual piece. The free parameters in TCPCwnd(k) are used to adapt the formula for each piece: startCwnd and ssthresh are initial conditions at the point of the loss, and TCPBase corresponds to the last loss location.
Figure 5: Closed-form solution of TCP simulation.
In fast retransmit/recovery, TCP Reno uses duplicate acknowledgments to infer the position of a lost packet (Figure 6). The lost packet is retransmitted, the cwnd is halved, and transmission of new data temporarily squelched to decrease the number of in-flight packets to newCwnd. Likewise, Trickles uses its SACK proof to infer the location of lost packets, retransmits these packets, halves the cwnd, and terminates a sufficient number of trickles to deflate the number of in-flight packets to newCwnd (Figure 7).![]()
Figure 6: TCP recovery. Duplicate ACKs signal recovery (A). Subsequent ACKs are ignored until number of outstanding packets drops to new cwnd. Recovery ends when client acknowledges all packets (B).
![]()
Figure 7: Trickles Recovery. First packet following loss triggers a retransmission (A). Trickles are subsequently terminated to deflate cwnd (B). Recovery ends when cwnd survivors are generated; cwnd has dropped from the original value of 5 to 2 (C).
firstLoss := sequence number of
first loss in input
cwndAtLoss := TCPCwnd(firstLoss - 1)
lossOffset := k - firstLoss
newCwnd := numInFlight / 2
The protocol variable firstLoss is derived from the SACK proof.
The SACK proofs for the trickle immediately after a loss,
as well as all subsequent trickles before recovery, will report a
gap. The SACK prefix invariant ensures
that each trickle will compute consistent values for the protocol
variables shown above.a := firstLoss ssthresh := TCPCwnd(a-1)/2 cwnd := InitialCwnd
The minisocket API is shown in Figure 8. Minisockets are represented by the structure minisock *. All minisockets share the same file descriptor (fd), that of their listen (server) socket. To send data with a minisocket, applications use msk_send(). It copies packet data to the kernel, constructs and sends Trickles response packets, then deallocates the minisocket. msk_setucont() allows the application to install user continuations on a per-packet basis. Trickles also provides scatter-gather, zero copy, and packet batch processing interfaces.msk_send(int fd, minisock *, char *, size_t); msk_sendv(int fd, minisock *, tiovec *, int); msk_sendfilev(int fd, minisock *, fiovec *, int); msk_setucont(int fd, minisock *, int pkt, char* buf, size_t); msk_sendbulk(int fd, mskdesc *, int len); msk_drop(int fd, minisock *); msk_detach(int fd, minisock *); msk_extract_events(int fd, extract_mskdesc_in *, int inlen, msk_collection *, int *outlen); msk_install_events(int fd, msk_collection *, int); msk_request(int fd, char *req, int reqlen, int reservelen);
Figure 8: The minisocket API.
Figure 9 shows that the aggregate throughput achieved by Trickles is within 10% of TCP at all client counts. Regular TCP consumes memory separately for each connection to buffer outgoing data until it is acknowledged. Beyond 6000 clients, TCP exhausts memory, forcing the kernel to kill the server process. In contrast, the Trickles kernel does not retain outgoing data, and recomputes lost packets as necessary from the original source. Consequently, it does not suffer from a memory bottleneck.
We next examine the memory and CPU utilization of the Trickles protocol. For this experiment, we eliminated the bottleneck link in the network and connected the clients to the server through the full 1Gb/sec link to pose a worst-case scenario.
Figure 12 shows a breakdown of the CPU overhead for Trickles and TCP on a 1 Gb/s link when Trickles is reconstructing its state for every packet (i.e. soft-state caching is turned off). Not surprisingly, Trickles has higher CPU utilization than TCP, since it verifies and recomputes state that it does not keep locally. The overhead is evenly split between the cryptographic operations required for verification and the packet processing required to simulate the TCP engine. While the Trickles CPU overhead is higher, it does not pose a server bottleneck even at gigabit speeds.
We also verified that issuing continuation requests in parallel improves performance. We added the msk_request() interface that takes application-specified data and reliably transmits the data to the server for conversion into an output continuation. These requests are non-blocking, and multiple such requests can be pending at any time. In Figure 14, the object sizes are small, so a Trickles client using SKIP with the sockets interface cannot receive output continuations quickly enough to fill the link. The Trickles client supporting parallel requests can receive continuations more frequently, resulting in performance comparable to TCP.
We validated Trickles under real Internet conditions using PlanetLab [5]. We ran a variant of the throughput experiment in which both the server and the client were located in our local cluster, but with all traffic between the two nodes redirected (bounced) through a single PlanetLab node m. Packets are first sent from the source node to m, then from m to the destination node. Thus, packets incur twice the underlying RTT to PlanetLab.
Trickles enables connections to fail over from a failed server to a live backup simply through a network-level redirection. If network conditions do not change significantly during the failover to invalidate the protocol parameters captured in the continuation, a server replica can resume packet processing transparently and seamlessly. In contrast, TCP recovery from server failure fundamentally requires several out of band operations. TCP needs to detect the disconnection, re-establish the connection with another server, and then ramp back up to the original data rate.
Figure 17 compares the Jain's fairness index[14] of the total throughput versus the uniform allocation. For most data points, Trickles more closely matches the uniform distribution than TCP does.