Thorsten
von
Eicken

P U B L I C A T I O N S

on High-Performance Communication

Most documents below are available in several formats: Adobe PDF (click on the title), Postscript (), and gzip-compressed Postscript (). Many of the documents are "reprints" and carry a copyright notice which allows you to print or copy the documents for personal use only. Please refer to the notice on each document for the details.

P a p e r s

MRPC: A High Performance RPC System for MPMD Parallel Computing. Chi-Chao Chang, Grzegorz Czajkowski, and Thorsten von Eicken. Software Practice and Experience. J. Wiley & Sons, Inc.

MRPC is an RPC system that is designed and optimized for MPMD parallel computing. Existing systems based on standard RPC incur an unnecessarily high cost when used on high-performance multi-computers, limiting the appeal of RPC-based languages in the parallel computing community. MRPC combines the efficient control and data transfer provided by Active Messages (AM) with a minimal multithreaded runtime system that extends AM with the features required to support MPMD. This approach introduces only the necessary runtime RPC overheads for an MPMD environment. MRPC has been integrated into Compositional C++ (CC++), a parallel extension of C++ that offers an MPMD programming model. Basic RPC performance in MRPC is within a factor of two from those of Split-C, a highly tuned SPMD language, and other messaging layers. CC++ applications perform within a factor of two to six from comparable Split-C versions, which represent an order of magnitude improvement over previous CC++ implementations.

Evaluating the Performance Limitations of MPMD Communication, Chi-Chao Chang, Grzegorz Czajkowski, Thorsten von Eicken, and Carl Kesselman. In Proceedings of SC ‘97, San Jose, CA, November 15-91, 1997. Also available as Department of Computer Science, Cornell University, Technical Report TR97-1630, 1997
The MPMD approach for parallel computing is attractive for programmers who seek fast development cycles, high code re-use, and modular programming, or whose applications exhibit irregular computation loads and communication patterns. RPC is widely adopted as the communication abstraction for crossing address space boundaries. However, the communication overheads of existing RPC-based systems are usually an order of magnitude higher than those found in highly tuned SPMD systems. This problem has thus far limited the appeal of high-level programming languages based on MPMD models in the parallel computing community.

This paper investigates the fundamental limitations of MPMD communication using a case study of two parallel programming languages, Compositional C++ (CC++) and Split-C, that provide support for a global name space. To establish a common comparison basis, our implementation of CC++ was developed to use MRPC, a RPC system optimized for MPMD parallel computing and based on Active Messages. Basic RPC performance in CC++ is within a factor of two from those of Split-C and other messaging layers. CC++ applications perform within a factor of two to six from comparable Split-C versions, which represent an order of magnitude improvement over previous CC++ implementations. The results suggest that RPC-based communication can be used effectively in many high-performance MPMD parallel applications.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems, Beng-Hong Lim, Chi-Chao Chang, Grzegorz Czajkowski and Thorsten von Eicken. Proceedings of the Sixth ACM Symposium on Principles and Practice of Parallel Programming (PPoPP ’97), Las Vegas, Nevada, June 1997.
Global addressing of shared data simplifies parallel programming and complements message passing models commonly found in dis-tributed memory machines. A number of programming systems have been designed that synthesize global addressing purely in software on such machines. These systems provide a number of communication mechanisms to mitigate the effect of high communication latencies and overheads. This study compares the mechanisms in two representative all-software systems: CRL and Split-C. CRL uses region-based caching while Split-C uses split-phase and push-based data transfers for optimizing communication perfor-mance. Both systems take advantage of bulk data transfers.

By implementing a set of parallel applications in both CRL and Split-C, and running them on the IBM SP2, Meiko CS-2 and two simulated architectures, we find that split-phase and push-based bulk data transfers are essential for good performance. Region-based caching benefits applications with irregular structure and with sufficient temporal locality, especially under high communication latencies. However, caching also hurts performance when there is insufficient data reuse or when the size of caching granularity is mismatched with the communication granularity. We find the programming complexity of the communication mechanisms in both languages to be comparable. Based on our results, we recommend that an ideal system intended to support diverse applications on parallel platforms should incorporate the communication mechanisms in CRL and Split-C.
Incorporating Memory Management into User-Level Network Interfaces, Anindya Basu, Matt Welsh, Thorsten von Eicken. Presented at Hot Interconnects V, August 1997, Stanford University.
Also available as Department of Computer Science, Cornell University, Technical Report TR97-1620, February 13th, 1997.
User-level network interfaces allow applications direct access to the network without operating system inter-vention on every send and receive. Messages are transferred directly to and from user-space by the network interface while observing the traditional protection boundaries between processes. Current user-level network interfaces limit this message transfer to a per-process region of permanently-pinned physical memory to allow safe DMA. This approach is inflexible in that it requires data to be copied into and out of this memory region, and does not scale to a large number of processes.

This paper presents an extension to the U-Net user-level network architecture (U-Net/MM) allowing messages to be transferred directly to and from any part of an application’s address space. This is achieved by integrating a translation look-aside buffer into the network interface and coordinating its operation with the operating system’s virtual memory subsystem. This mechanism allows network buffer pages to be pinned and unpinned dynamically. Two implementations of U-Net/MM are described, demonstrating that existing commodity hardware and commercial operating systems can efficiently support the architecture.

Design and Performance of Active Messages on the IBM SP-2, Chi-Chao Chang, Grzegorz Czajkowski and Thorsten von Eicken, TR96-1572, February 27, 1996.

This technical report describes the design, implementation, and evaluation of Active Messages on the IBM SP-2. The implementation benchmarked here uses the standard TB2 network adapter firmware but does not use any IBM software on the Power2 processor. We assume familiarity with the concepts underlying Active Messages. The main performance characteristics are a one-word message round-trip time of 51.0 $\mu$s and an asymptotic network bandwidth of 34.3 MB/s. After presenting selected implementation details, the paper focuses on detailed performance analysis, including a comparison with IBM's Message Passing Layer (MPL) and Split-C benchmarks.

ATM and Fast Ethernet Network Interfaces for User-level Communication, Matt Welsh, Anindya Basu, and Thorsten von Eicken. Proceedings of the Third International Symposium on High Performance Computer Architecture (HPCA), San Antonio, Texas, February 1-5, 1997.
Fast Ethernet and ATM are two attractive network technologies for interconnecting workstation clusters for parallel and distributed computing. This paper compares network interfaces with and without programmable co-processors for the two types of networks using the U-Net communication architecture to provide low-latency and high-bandwidth communication. U-Net provides protected, user-level access to the network interface and offers application-level round-trip latencies as low as 60 usec over Fast Ethernet and 90 usec over ATM.

The design of the network interface and the underlying network fabric have a large bearing on the U-Net design and performance. Network interfaces with programmable co-processors can transfer data directly to and from user space while others require aid from the operating system kernel. The paper provides detailed performance analysis of U-Net for Fast Ethernet and ATM, including application-level performance on a set of Split-C parallel benchmarks. These results show that high-performance computing is possible on a network of PCs connected via Fast Ethernet.
A Language-Based Approach to Protocol Construction, Anindya Basu, Mark Hayden, Greg Morrisett, and Thorsten von Eicken. Proceedings of the ACM SIGPLAN Workshop on Domain Specific Languages (WDSL), Paris, France, January 1997.
User-level network architectures that provide applications with direct access to network hardware have become popular with the emergence of high-speed networking technologies such as ATM and Fast Ethernet. However, experience with user-level network architectures such as U-Net [vEBBV95] has shown that building correct and efficient protocols on such architectures is a challenge. To address this problem, Promela++, an extension of the Promela protocol validation language has been developed. Promela++ allows automatic verification of protocol correctness against programmer specified safety requirements. Furthermore, Promela++ can be compiled to efficient C code. Thus far, the results are encouraging: the C code produced by the Promela++ compiler shows performance comparable to hand-coded versions of a relatively simple protocol.
Low-Latency Communication on the IBM RISC System/6000 SP, Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel, and Thorsten von Eicken. Proceedings of Supercomputing '96, Pittsburgh, PA, November, 1996.
The IBM SP is one of the most powerful commercial MPPs, yet, in spite of its fast processors and high network bandwidth, the SP's communication latency is inferior to older machines such as the TMC CM-5 or Meiko CS-2. This paper investigates the use of Active Messages (AM) communication primitives as an alternative to the standard message passing in order to reduce communication overheads and to offer a good building block for higher layers of software.

The first part of this paper describes an implementation of Active Messages (SP AM) which islayered directly on top of the SP's network adapter (TB2). With comparable bandwidth, SP AM's low overhead yields a round-trip latency that is 40% lower than IBM MPL's. The second part of the paper demonstrates the power of AM as a communication substrate by layering Split-C as well as MPI over it. Split-C benchmarks are used to compare the SP to other MPPs and show that low message overhead and high throughput compensate for SP's high network latency. The MPI implementation is based on the freely available MPICH version and achieves performance equivalent to IBM's MPI-F on the NAS benchmarks.
LogP: A Practical Model of Parallel Computation, Communications of the ACM, November 1996, Vol. 39, No.11. David E. Culler, Richard M. Karp, David Patterson, Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian and Thorsten von Eicken.

Low-Latency Communication over Fast Ethernet, Matt Welsh, Anindya Basu, Thorsten von Eicken. Proceedings of Euro-Par '96, Lyon, France, August 27-29, 1996.
Fast Ethernet (100Base-TX) can provide a low-cost alternative to more esoteric network technologies for high-performance cluster computing. We use a network architecture based on the U-Net approach to implement low-latency and high-bandwidth communication over Fast Ethernet, with performance rivaling (and in some cases exceeding) that of 155 Mbps ATM. U-Net provides protected, user-level access to the network interface and enables application-level round-trip latencies of less than 60 usec over Fast Ethernet.
U-Net: A User-Level Network Interface for Parallel and Distributed Computing, Anindya Basu, Vineet Buch, Werner Vogels, Thorsten von Eicken. Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), Copper Mountain, Colorado, December 3-6, 1995.
The U-Net communication architecture provides processes with a virtual view of a network interface to enable user-level access to high-speed communication devices. The architecture, implemented on standard workstations using off-the-shelf ATM communication hardware, removes the kernel from the communication path, while still providing full protection.

The model presented by U-Net allows for the construction of protocols at user level whose performance is only limited by the capabilities of network. The architecture is extremely flexible in the sense that traditional protocols like TCP and UDP, as well as novel abstractions like Active Messages can be implemented efficiently. A U-Net prototype on an 8-node ATM cluster of standard workstations offers 65 microseconds round-trip latency and 15 Mbytes/sec bandwidth. It achieves TCP performance at maximum network bandwidth and demonstrates performance equivalent to Meiko CS-2 and TMC CM-5 supercomputers on a set of Split-C benchmarks.

Delivering High-Performance Communication to the Application-Level. Proceedings of the Third IEEE Workshop on the Architecture and Implementation of High Performance Communication Subsystems (HPCS'95), August 1995. Werner Vogels and Thorsten von Eicken.

Low-Latency Communication over ATM Networks using Active Messages, von Eicken, T., V. Avula, A. Basu, V. Buch. Presented at Hot Interconnects II, Aug 1994, Palo Alto, CA. An abridged version of this paper appears in IEEE Micro Magazine, Feb. 1995.
Recent developments in communication architectures for parallel machines have made significant progress and reduced the communication overheads and latencies by over an order of magnitude as compared to earlier proposals. This paper examines whether these techniques can carry over to clusters of workstations connected by an ATM network even though clusters use standard operating system software, are equipped with network interfaces optimized for stream communication, do not allow direct protected user-level access to the network, and use networks without reliable transmission or flow control.

In a first part, this paper describes the differences in communication characteristics between clusters of workstations built from standard hardware and software components and state-of-the-art multiprocessors. The lack of flow control and of operating system coordination affects the communication layer design significantly and requires larger buffers at each end than on multiprocessors. A second part evaluates a prototype implementation of the low-latency Active Messages communication model on a Sun workstation cluster interconnected by an ATM network. Measurements show application-to-application latencies of about 20 microseconds for small messages which is roughly comparable to the Active Messages implementation on the Thinking Machines CM-5 multiprocessor.

November 1998

Department of Computer Science
Cornell University

Thorsten von Eicken

Thorsten
von
Eicken