Thorsten von Eicken, Veena Avula, Anindya Basu, and Vineet Buch
Department of Computer Science
In a first part, this paper describes the differences in communication characteristics between clusters of workstations built from standard hardware and software components and state-of-the-art multiprocessors. The lack of flow control and of operating system coordination affects the communication layer design significantly and requires larger buffers at each end than on multiprocessors. A second part evaluates a prototype implementation of the low-latency Active Messages communication model on a Sun workstation cluster interconnected by an ATM network. Measurements show application-to-application latencies of about 20 microseconds for small messages which is roughly comparable to the Active Messages implementation on the Thinking Machines CM-5 multiprocessor.
In fact, the use of standard components in clusters raises the question whether these can be reasonably used for parallel processing. Recent advances in multiprocessor communication performance are principally due to a tighter integration of programming models, compilers, operating system functions, and hardware primitives. It is not clear whether these advances can be carried over to clusters or whether the use of standard components is squarely at odds with achieving the level of integration required to enable modern parallel programming models. Specifically, new communication architectures such as distributed shared memory, explicit remote memory access, and Active Messages reduced the costs from hundreds to thousands of microseconds to just a few dozen precisely through the integration of all system components. These new communication architectures are designed such that network interfaces can implement common primitives directly in hardware, they allow the operating system to be moved out of the critica This paper examines whether the techniques developed to improve communication performance in multiprocessors, in particular, Active Messages, can be carried over to clusters of workstations with standard networks and mostly standard system software. This paper assumes the current state of the art technology in which clusters using ATM networks differ from multiprocessors in three major aspects(4):
A more typical setting would be as computational resource in a distributed application. One such example, the Stormcast weather monitoring system in Norway, runs on a very large collection of machines spread across a large portion of the country, but uses a cluster of a few dozen workstations in a machine room (without high speed network in this case) to run compute-intensive weather prediction models and to emit storm warnings. The availability of low-latency communication among these workstations would enable the use of parallel programming languages and of more powerful parallel algorithms, both of which require a closer coupling among processors than is possible today.
Concentrating on the compute cluster offers the largest potential for improvement because the latency over the long-haul links is dominated by speed-of-light and network congestion issues and because the wide area communication is comparatively better served by today's distributed computing software. Note that this paper does not argue that running concurrent applications in a heterogeneous environment, across large distances, and on workstations that happen to be sitting idle is not an interesting design point (it in fact has been used successfully), but that the set of communication issues occurring in such a context cannot be compared to those in a multiprocessor.
Given that the applications for clusters considered here exhibit characteristics similar to those on multiprocessors, the programming models used would be comparable, if not identical, to those popular for parallel computing. This includes various forms of message passing (e.g., send/receive, PVM), of shared memory (e.g., cache coherent shared memory, remote reads and writes, explicit global memory), and of parallel object oriented languages (e.g., numerous C++ extensions).
On parallel machines several proposed communication architectures have achieved the low overheads, low latencies, and high bandwidths that are required for high performance implementations of the above programming models. In particular, cache coherent shared memory, remote reads and writes, and Active Messages offer round-trip communication within a few hundred instruction times, so that frequent communication on a fine granularity (such as on an object by object or cache line basis) remains compatible with high performance. In these settings, the overhead of communication, that is, the time spent by the processor initiating communication, is essentially the cost of pushing message data into the network interface at the sending end and pulling it out at the receiving end. Virtually no cycles are spent in any protocol handling as all reliability and flow control are handled in hardware. The operating system need not be involved in every communication operation because the network interface hardware can enforce protection boundaries across the network.
The above communication architectures cannot be moved in a straightforward manner from multiprocessors to clusters of workstations with ATM networks because of three major differences between the two: ATM networks offer neither reliable delivery nor flow control, ATM network interfaces provide no support for protected user-level access to the network, and the workstation operating systems do not coordinate process scheduling or address translation globally. Coping with these differences poses major technical challenges and may eventually require the integration of some multiprocessor-specific features into the clusters. The following three subsections present the nature of these differences in more detail and discuss the resulting issues.
In contrast, an ATM network does not provide any form of flow control and does not offer reliable delivery. Instead, higher protocol layers must detect cell loss or corruption and cause their retransmission. While this partitioning of responsibilities may be acceptable in the case of stream-based communication (e.g., TCP/IP, video, audio) it is questionable in a parallel computing setting.
The flow control and the error detection and correction in multiprocessor networks serve to cover four causes of message loss: buffer overflow in the receiving software, buffer overflow in the receiving network interface, buffer overflow within the network, and message corruption due to hardware errors. In an ATM network, simple window based end-to-end flow control schemes and a per-message CRC (as used in AAL-5) can cover the first and last cases(5) of cell loss. In addition, preventing buffer overflow in the receiving network interface can be achieved by ensuring that the rate at which cells can be moved from the interface into main memory is at least as large as the maximal cell arrival rate. Preventing buffer overflow within the network, however, is not realistically possible using end-to-end flow control. This is particularly a problem in a parallel computing setting in which all nodes tend to communicate with all other nodes in both highly regular and irregular patterns at unpredictable intervals. The degree of contention within the network therefore cannot be measured or predicted with any accuracy by either the sender or the receiver and communication patterns which result in high contention will result in high cell loss rates causing extensive retransmissions.
Traditional flow control schemes used in stream-based communication avoid fruitless retransmission storms by dynamically reducing the transmission rate on connections which experience high cell loss rates. This works in these settings because, following the law of large numbers, contention in a wide area network does not tend to vary instantaneously and therefore the degree of contention observed in the recent past is a good predictor for contention in the near future.
As an illustration of the difficulties in a parallel computing setting, consider the implementation of a parallel sort. The most efficient parallel sort algorithms  are based on an alternation of local sorts on the nodes and permutation phases in which all nodes exchange data with all other nodes. These permutation phases serve to move the elements to be sorted "towards" their correct position, The communication patterns observed are highly dynamic and their characteristics depend to a large degree on the input data. If at any point the attempted data rate into a given node exceeds the link rate, then the output buffers at up-stream switches will start filling up. Because the communication patterns change very rapidly (essentially with every cell), it is futile to attempt to predict contention, and given the all-to-all communication pattern, the probability of internal contention among seemingly unrelated connections is high.
Beyond the problems caused by contention and the resulting retransmissions, the lack of reliable delivery guarantee in ATM networks imposes a certain overhead on the communication primitives. Specifically, the sender must keep a copy of each cell sent until a corresponding acknowledgment is received, in case the cell must be retransmitted. This means that messages cannot be transferred directly between processor registers and the network interface (as is possible on the CM-5 ), rather, a memory copy must be made as well.
In contrast, the network interfaces available for workstations do not yet incorporate any form of protection mechanism. Instead, the operating system must be involved in the sending and reception of every message. The connection based nature of ATM networks would principally allow the design of a protection mechanism to limit the virtual circuits a user process has access to (the operating system would still control virtual circuit set-up). But because the architecture of the networking layers in current operating systems does not seem to be set-up to allow user-level network interface access, it appears unlikely that network interfaces with these features will become commonplace soon. The challenge in any high-performance communication layer for clusters is, thus, to minimize the path through the kernel by judiciously coordinating the user-kernel interactions.
In shared memory systems this is done by coordinating the address translation tables among all processing nodes such that the originating node can translate the virtual memory address of a remote access and directly place the corresponding physical memory address into the message. The set of communication primitives is small and fixed (e.g., read and write) and by forcing the sender to perform the complicated part of a remote memory access (namely the protection checks and the address translation) the handling of a request is relatively simple to implement(6). If the virtual address were sent, the receiving node could discover that the requested virtual memory location had been paged out to disk with the result that the handling of the message would become rather involved.
In Active Messages on multiprocessors the scheduling of processes is assumed to be coordinated among all nodes such that communicating processes execute simultaneously on their respective nodes. This guarantees that messages can be handled immediately on arrival by the destination process itself. In order to accomplish this, the sender of an Active Message specifies a user-level handler at the destination whose role it is to extract the message from the network and integrate it into the ongoing computation. The handler can also implement a simple remote service and send a reply Active Message back. However, in order to prevent deadlock the communication patterns are limited to requests and replies, e.g., a handler of a reply message is not allowed to send any further messages. An implementation of Active Messages typically reserves the first word of each message for the handler address, and the handler at the receiving end is dispatched immediately on message arrival to dispose of the message. The fact that the message layer can call upon the handlers to deal with messages in FIFO order simplifies the buffering considerably over that required by more traditional message passing models such as PVM, MPI, or NX. These models allow processes to consume messages in arbitrary order and at arbitrary times forcing the communication architecture to implement very general buffer and message matching mechanisms at high cost.
In clusters the fact that the operating systems of the individual nodes are not nearly as coordinated contradicts the assumption that messages can always be consumed quickly upon arrival. In the case of Active Messages the destination process might have been suspended and cannot run the handler, and in a shared memory model the memory location requested might not be mapped. Although exact coordination is not possible without major changes to the operating system core, an implementation of either communication model is likely to be able to perform some coordination among nodes on its own and to influence the local operating system accordingly. This may allow the communication layer to assume that in the common case everything works out fine, but it must be able to handle the difficult cases as well.
These shortcomings will inevitably result in lower communication performance; their quantitative effect on performance is evaluated in the next section which presents a prototype implementation of Active Messages on a cluster of Sun workstations. However, the lack of flow-control in ATM networks poses a fundamental problem: can catastrophic performance degradation occur due to significant cell loss in particular communication patterns?
The basic communication primitive is a message with an associated small amount of computation (in the form of a handler) at the receiving end. Typically the first word of an Active Message points to the handler for that message. On message arrival, the computation on the node is interrupted and the handler is executed. The role of the handler is to get the message out of the network, by integrating it into the ongoing computation and/or by sending a reply message back. The buffering and scheduling provided by Active Messages are extremely primitive and thereby fast: the only buffering is that involved in actual transport and the only scheduling is that required to activate the handler. This is sufficient to support many higher-level abstractions and more general buffering and scheduling can be easily constructed in layers above Active Messages when needed. This minimalist approach avoids paying a performance penalty for unneeded functionality.
In order to prevent deadlock and livelock, Active Message restricts communication patterns to requests and replies, i.e., the handler of a request message is only allowed to send a reply message and a reply handler is not allowed to send further replies.
The C header file for the interface to SSAM is shown in Figure 1. To send a request Active Message, the user places the message data into a per-connection buffer provided by SSAM and calls SSAM_10 with a connection identifier and the remote handler address. SSAM_10 adds the flow-control information and traps to the kernel to have the message injected into the network. It also polls the receiver and processes incoming messages. At the receiving end, the network is polled by SSAM_10 or SSAM_poll (the latter only polls the network) and all messages accumulated in the receive FIFO are moved into a buffer. SSAM then calls the appropriate handler for each message, passing as arguments the originating connection identifier, the address of the buffer holding the message, and the address of a buffer for a reply message. The handler processes the message and may send a reply message back by placing the data in the buffer provided and returning the address of the reply handler (or NULL if no reply is to be sent).
Figure 1: C interface for SPARCstation Active Messages
The current prototype does not use interrupts, instead, the network is polled every time a message is sent. This means that as long as a process is sending messages it will also handle incoming ones. An explicit polling function is provided for program parts which do not communicate. Using interrupts is planned but not implemented yet.
Figure 2: Sample remote read implementation using SSAM
Note that the network interface used is much simpler and closer to multiprocessor NIs than most second-generation ATM interfaces available today. The only function performed in hardware, beyond simply moving cells onto/off the fiber, is checksum generation and checking for the ATM header and an AAL3/4 compatible payload. In particular, no DMA or segmentation and reassembly of multi-cell packets is provided.
The traps to send and receive cells consist of carefully crafted assembly language routines. Each routine is small (28 and 43 instructions for the send and receive traps, respectively) and uses available registers carefully. The register usage is simplified by the Sparc architecture's use of a circular register file, which provides a clean 8-register window for the trap. By interfacing from the program to the traps via a function call, arguments can be passed and another 8 registers become available to the trap.
The following paragraphs describe the critical parts of the SSAM implementation in more detail.
In order to implement the flow control for a window of size w, each process pre-allocates memory to hold 4w cells per every other process with which it communicates. The algorithm to send a request message polls the receiver until a free window slot is available and then injects the cell into the network, saving it in one of the buffers as well in case it has to be retransmitted. Upon receipt of a request message, the message layer moves the cell into a buffer and, as soon as the corresponding process is running, calls the Active Message handler. If the handler issues a reply, it is sent and a copy is held in a buffer. If the handler does not generate a reply, an explicit acknowledgment is sent. Upon receipt of the reply or acknowledgment, the buffer holding the original request message can be reused. Note how the distinction between requests and replies made in Active Messages allows acknowledgments to be piggy-backed onto replies.
The recovery scheme used in case of lost or duplicate cells is standard, except that the reception of duplicate request messages may indicate lost replies which have to be retransmitted. It is important to realize that this flow control mechanism does not really attempt to minimize message losses due to congestion within the network: the lack of flow-control in ATM networks effectively precludes any simple congestion avoidance scheme. Until larger test-beds become available and the ATM community agrees on how routers should handle buffer overflows it seems futile to invest in more sophisticated flow-control mechanisms. Nevertheless, the bursty nature of parallel computing communication patterns are likely to require some solution before the performance characteristics of an ATM network become as robust as those of as multiprocessor networks.
At the receiving end, the read-ATM kernel trap delivers a batch of incoming cells into a pre-determined shared memory buffer. The number of cells received is returned in a register. For each cell the kernel performs four tasks: it loads the first half of the cell into registers, uses the VCI to index into a table to obtain the address of the appropriate processes' input buffer, moves the full cell into that buffer, and checks the integrity of the cell using three flag bits set by the NI in the last byte. Upon return from the trap, the SSAM library loops through all received cells checking the flow-control information, calling the appropriate handlers for request and reply messages, and sending explicit acknowledgments when needed.
Table 1: Cost breakdown for traps to send and receive cells. --------------------------------------------------------- Operation SS-20 SS-1+ --------------------------------------------------------- write trap trap+rett 0.44us 2.03us check pid and con 0.05us 0.49us nection id addt'l kernel ovhd 0.05us 0.50us load cell to push 0.13us 3.87us push cell to NI 2.05us 3.17us total 2.72us 10.11us read trap trap+rett 0.44us 2.03us check cell count 0.81us 1.08us addt'l kernel ovhd 0.18us 0.80us per cell pull from NI 4.27us 3.68us per cell demux 0.09us 0.23us per cell store away 0.17us 3.50us total for 1 cell 5.87us 11.32us per cell total for 16 4.61us 8.08us cells write_read trap total, 0 cells read 3.7us 11.2us total, 1 cell read 8.2us 21.4us null system call 6.9us 40us ---------------------------------------------------------The write trap cost is broken down into 5 parts: the cost of the trap and return, the protection checks, overhead for fetching addresses, loading the cell into registers, and pushing the cell into the network interface. The SS-20 numbers show clearly that the fiber can be saturated by sending a cell at a time from user level. It also indicates that the majority of the cost (75%) lies in the access to the network interface across the Sbus. The cost of the trap itself is surprisingly low, even though it is the second largest item. In fact, it could be reduced slightly as the current implementation adds a level of indirection in the trap dispatch to simplify the dynamic loading of the device driver.(8)
The read trap is itemized similarly: the cost to trap and return, fetching the device register with the count of available cells, additional overhead for setting-up addresses, loading the cell from the network interface, demultiplexing among processes, and storing the cell away. The total cost shows a trap which receives a single cell, as well as the per-cell cost for a trap which receives 16 cells. Here again the access to the device dominates due to the fact that each double-word load incurs the full latency of an Sbus access. The total time of 4.61us on the SS-20 falls short of the fiber's cell time and will limit the achievable bandwidth to at most 68% of the fiber.
The write-read trap first sends a cell and then receives a chunk of cells. This amortizes the cost of the trap across both functions and overlaps checking the cell count slightly with sending. The last item in the table shows the cost of a null system call for comparison purposes (a write to file descriptor -1 was used). It is clear that a system call approach would yield performance far inferior to the traps and would achieve only a fraction of the fiber bandwidth.
Table 2 shows the costs for the various parts of the read and write system calls. The "syscall overhead" entries reflect the time taken for a read (respectively write) system call with an empty read (write) device driver routine. This measures the kernel overhead associated with these system calls. The "check fd, do uiomove" entry reflects the time spent in checking the validity of the file descriptor and performing the uiomove. In the case of a read, it also includes the time to check the device register holding the number of cells available in the input FIFO. The "push/pull cell" entries reflect the time spent to transfer the contents of one cell between the internal buffer and the device FIFOs. The "write" and "read 1 cell" totals reflect the cost of the full system call, while the "read 0 cells" entry is the time taken for an unsuccessful poll which includes the system call overhead, the file descriptor checks, and the reading of the receive-ready register.
Table 2: Cost of sending and receiving cells using read and write system calls. ----------------------------------------------------------- Operation SS-20 SS-1+ ----------------------------------------------------------- write system call syscall overhead 22.6us 100us check fd, do uiomove 3.4us 16us push cell into NI 2.2us 8us write total 28.2us 124us read system call syscall overhead 22.1us 99us pull cell from NI 5.0us 13us check fd and recv ready, 7.0us 25us do uiomove read total for 1 cell 34.1us 137us read total for 0 cells 28.8us 113us -----------------------------------------------------------The timings show clearly that the overhead of the read/write system call interface is prohibitive for small messages. For larger messages, however, it may well be a viable choice and it is more portable than the traps.
Table 3: Cost breakdown for SPARCstation Active Messages. --------------------------------------- Operation SS-20 SS-1+ --------------------------------------- send request 5.0us 15us handle request, no reply 5.6us 15us sent handle request and send 7.7us 25us reply handle ack 5.0us 11us handle reply 5.2us 12us ---------------------------------------The measurements show that supporting only single-cell Active Messages is not optimal. Longer messages are required to achieve peak bulk transfer rates: the one-cell-at-a-time prototype can yield up to 5.6MB/s. A simpler interface for shorter messages (e.g., with only 16 bytes of payload) might well be useful as well to accelerate the small requests and acknowledgments that are often found in higher-level protocols. Unfortunately, given that the trap cost is dominated by the network interface access time and that the SBA-100 requires all 56 bytes of a cell to be transferred by the processor, it is unlikely that a significant benefit can be realized.
The amount of memory allocated by the SSAM prototype is somewhat excessive and, in fact, for simplicity, the current prototype uses twice as many buffers as strictly necessary. For example, assuming that a flow-control window of 32 cells is used, the kernel allocates and pins 8Kbytes of memory per process per connection. On a 64-node cluster with 10 parallel applications running, this represents 5Mb of memory per processor.
The number of preallocated buffers could be reduced without affecting peak bulk transfer rates by adjusting the flow control window size dynamically. The idea is that the first cell of a long message contain a flag which requests a larger window size from the receiver; a few extra buffers would be allocated for this purpose. The receiver grants the larger window to one sender at a time using the first acknowledgment cell of the bulk transfer. The larger window size remains in effect until the end of the long message. This scheme has two benefits: the request for a larger window is overlapped with the first few cells of the long message, and the receiver can prevent too many senders from transferring large data blocks simultaneously, which would be sub-optimal for the cache. However, fundamentally, it appears that memory (or, alternatively, low performance) is the price to pay for having neither flow-control in the network nor coordinated process scheduling.
A more subtle problem having to do with the ATM payload alignment used by the SBA-100 interface will surface in the future: the 53 bytes of an ATM cell are padded by the SBA-100 to 56 bytes and the 48-byte payload starts with the 6th byte, i.e., it is only half-word aligned. The effect is that bulk transfer payload formats designed with the SBA-100 in mind (and supporting double-word moves of data between memory and the SBA-100) will clash with other network interfaces which double-word align the ATM payload.
The time taken by the flow-control and protection in software is surprisingly low (at least in comparison with the network interface access times). The cost, in effect, has been shifted to large pre-allocated and pinned buffers. While the prototype's memory usage is somewhat excessive, other schemes with comparable performance will also require large buffers.
Overall, SSAM's speed comes from a careful integration of all layers, from the language level to the kernel traps. The key issues are avoiding copies by having the application place the data directly where the kernel picks it up to move it into the device and by passing only easy to check information to the kernel (in particular not pass an arbitrary virtual address).
The major difference between the two models is that the remote memory operations separate data and control transfer while Active Messages unifies them. With remote memory accesses data can be transferred to user memory by the kernel without the corresponding process having to run. But the model used does not allow remote reads and writes to the full address space of a process. Rather, each communicating process must allocate special communication memory segments which are pinned by the operating system just as the buffers used by SSAM are. The communication segments are more flexible than SSAM's buffers in that they can directly hold data structures (limited by the fact that the segments are pinned).
The advantage of SSAM over the remote memory accesses is the coupling of data and control: each message causes a small amount of user code to be executed, which allows data to be scattered into complex data structures and the scheduling of computation to be directly influenced by the arrival of data. In the remote memory access model a limited control transfer is offered through per-segment notification flags in order to to cause a file descriptor to become ready.
Finally, SSAM provides a reliable transport mechanism while the remote memory access primitives are unreliable and do not provide flow-control.
Table 4 compares the performance of the two approaches: Thekkath's implementation uses two DECstation 5000 interconnected by a Turbochannel version of the same Fore-100 ATM interface used for SSAM and performs a little worse than SSAM for data transfer and significantly worse for control transfer. The remote reads and writes are directly comparable in that they transfer the same payload per cell.
Table 4: Comparison of SSAM to Remote Memory Accesses between 2 DECstation 5000sover ATM . ------------------------------------ Operation SSAM Remote mem access ------------------------------------ read latency 32us 45us write latency 22us 30us addt'l control none 260us transfer ovhd block write 5.5MB/s 4.4MB/s ------------------------------------The performance of more traditional communication layers over an ATM network has been evaluated by Lin et. al.  and shows over two orders of magnitude higher communication latencies than SSAM offers. Table 5 summarizes the best round-trip latencies and one-way bandwidths attained on Sun 4/690's and SPARCstation 2's connected by Fore SBA-100 interfaces without switch. The millisecond scale reflects the costs of the traditional networking architecture used by these layers, although it is not clear why Fore's AAL/5 API is slower than the read/write system call interface described in 3.4.2. Note that a TCP/IP implementation with a well-optimized fast-path should yield sub-millisecond latencies.
Table 5: Performance of traditional communication layers on Sun4/690s and SPARCstation2s over ATM . ------------------------------------------- Communication layer Round-trip
Peak latency bandwidth ------------------------------------------- Fore AAL/5 API 1.7ms 4MB/s BSD TCP/IP Sockets 3.9ms 2MB/s PVM over TCP/IP 5.4ms 1.5MB/s Sun RPC 3.9ms 1.6MB/s -------------------------------------------
The use of standard components, and in particular, of ATM networking technology, results in three major disadvantages of clusters with respect to multiprocessors: (i) ATM networks do not offer reliable delivery or flow control, (ii) the current network interfaces are not well integrated into the workstation architecture, and (iii) the operating systems on the nodes of a cluster do not coordinate process scheduling or address translations.
The prototype implementation of the Active Messages communication model described in this paper achieves two orders of magnitude better performance than traditional networking layers. Table 6 shows that the resulting communication latencies and bandwidths are in the same ball-park as on state-of-the-art multiprocessors. Key to the success are the use of large memory buffers and the careful design of a lean user-kernel interface. The major obstacle towards closing the remaining performance gap is the slow access to the network interface across the I/O bus, and reducing the buffer memory usage requires coordination of process scheduling across nodes. While taking care of flow control in software does not dominate performance in this study, the behavior of ATM networks under parallel computing communication loads remains an open question.
Table 6: Comparison of SSAM's performance with that of recent parallel machines. ---------------------------------------- Machine Peak
bandwidth latency ---------------------------------------- SP-1 + MPL/p  8.3MB/ 56us s Paragon + NX  73MB/s 44us CM-5 + Active 10MB/s 12us Mesg  SS-20 cluster + 5.6MB/ 32us SSAM s ----------------------------------------