Thorsten
von
Eicken
|
|
P U B L I C A T I O N S on High-Performance Communication
Most documents below are available in several formats:
Adobe PDF (click on the title), Postscript ( ), and gzip-compressed
Postscript ( ). Many of the documents are "reprints" and carry a
copyright notice which allows you to print or copy the documents for personal use only.
Please refer to the notice on each document for the details.
P a p e r s
MRPC: A High Performance RPC System for
MPMD Parallel Computing. Chi-Chao Chang, Grzegorz Czajkowski, and Thorsten
von Eicken. Software Practice and Experience. J. Wiley & Sons, Inc.
MRPC is an RPC system that is designed and optimized for MPMD parallel
computing. Existing systems based on standard RPC incur an unnecessarily high cost when
used on high-performance multi-computers, limiting the appeal of RPC-based languages in
the parallel computing community. MRPC combines the efficient control and data transfer
provided by Active Messages (AM) with a minimal multithreaded runtime system that extends
AM with the features required to support MPMD. This approach introduces only the necessary
runtime RPC overheads for an MPMD environment. MRPC has been integrated into Compositional
C++ (CC++), a parallel extension of C++ that offers an MPMD programming model. Basic RPC
performance in MRPC is within a factor of two from those of Split-C, a highly tuned SPMD
language, and other messaging layers. CC++ applications perform within a factor of two to
six from comparable Split-C versions, which represent an order of magnitude improvement
over previous CC++ implementations.
- Evaluating the Performance
Limitations of MPMD Communication, Chi-Chao Chang, Grzegorz Czajkowski,
Thorsten von Eicken, and Carl Kesselman. In Proceedings of SC ‘97, San Jose, CA,
November 15-91, 1997. Also available as Department of Computer Science, Cornell
University, Technical Report TR97-1630, 1997
 The MPMD approach for parallel computing is attractive for
programmers who seek fast development cycles, high code re-use, and modular programming,
or whose applications exhibit irregular computation loads and communication patterns. RPC
is widely adopted as the communication abstraction for crossing address space boundaries.
However, the communication overheads of existing RPC-based systems are usually an order of
magnitude higher than those found in highly tuned SPMD systems. This problem has thus far
limited the appeal of high-level programming languages based on MPMD models in the
parallel computing community.
This paper investigates the fundamental limitations of MPMD communication using a case
study of two parallel programming languages, Compositional C++ (CC++) and Split-C, that
provide support for a global name space. To establish a common comparison basis, our
implementation of CC++ was developed to use MRPC, a RPC system optimized for MPMD parallel
computing and based on Active Messages. Basic RPC performance in CC++ is within a factor
of two from those of Split-C and other messaging layers. CC++ applications perform within
a factor of two to six from comparable Split-C versions, which represent an order of
magnitude improvement over previous CC++ implementations. The results suggest that
RPC-based communication can be used effectively in many high-performance MPMD parallel
applications.
- Performance Implications of
Communication Mechanisms in All-Software Global Address Space Systems,
Beng-Hong Lim, Chi-Chao Chang, Grzegorz Czajkowski and Thorsten von Eicken. Proceedings of
the Sixth ACM Symposium on Principles and Practice of Parallel Programming (PPoPP
’97), Las Vegas, Nevada, June 1997.
 Global addressing of shared data simplifies parallel programming and
complements message passing models commonly found in dis-tributed memory machines. A
number of programming systems have been designed that synthesize global addressing purely
in software on such machines. These systems provide a number of communication mechanisms
to mitigate the effect of high communication latencies and overheads. This study compares
the mechanisms in two representative all-software systems: CRL and Split-C. CRL uses
region-based caching while Split-C uses split-phase and push-based data transfers for
optimizing communication perfor-mance. Both systems take advantage of bulk data transfers.
By implementing a set of parallel applications in both
CRL and Split-C, and running them on the IBM SP2, Meiko CS-2 and two simulated
architectures, we find that split-phase and push-based bulk data transfers are essential
for good performance. Region-based caching benefits applications with irregular structure
and with sufficient temporal locality, especially under high communication latencies.
However, caching also hurts performance when there is insufficient data reuse or when the
size of caching granularity is mismatched with the communication granularity. We find the
programming complexity of the communication mechanisms in both languages to be comparable.
Based on our results, we recommend that an ideal system intended to support diverse
applications on parallel platforms should incorporate the communication mechanisms in CRL
and Split-C.
- Incorporating Memory
Management into User-Level Network Interfaces, Anindya Basu, Matt Welsh,
Thorsten von Eicken. Presented at Hot Interconnects V, August 1997, Stanford University.

Also available as Department of Computer Science, Cornell University, Technical Report
TR97-1620, February 13th, 1997. User-level network interfaces allow applications direct access
to the network without operating system inter-vention on every send and receive. Messages
are transferred directly to and from user-space by the network interface while observing
the traditional protection boundaries between processes. Current user-level network
interfaces limit this message transfer to a per-process region of permanently-pinned
physical memory to allow safe DMA. This approach is inflexible in that it requires data to
be copied into and out of this memory region, and does not scale to a large number of
processes.
This paper presents an extension to the U-Net
user-level network architecture (U-Net/MM) allowing messages to be transferred directly to
and from any part of an application’s address space. This is achieved by integrating
a translation look-aside buffer into the network interface and coordinating its operation
with the operating system’s virtual memory subsystem. This mechanism allows network
buffer pages to be pinned and unpinned dynamically. Two implementations of U-Net/MM are
described, demonstrating that existing commodity hardware and commercial operating systems
can efficiently support the architecture.
This technical report describes the design,
implementation, and evaluation of Active Messages on the IBM SP-2. The implementation
benchmarked here uses the standard TB2 network adapter firmware but does not use any IBM
software on the Power2 processor. We assume familiarity with the concepts underlying
Active Messages. The main performance characteristics are a one-word message round-trip
time of 51.0 $\mu$s and an asymptotic network bandwidth of 34.3 MB/s. After presenting
selected implementation details, the paper focuses on detailed performance analysis,
including a comparison with IBM's Message Passing Layer (MPL) and Split-C benchmarks.
- ATM and Fast Ethernet
Network Interfaces for User-level Communication, Matt Welsh, Anindya Basu, and
Thorsten von Eicken. Proceedings of the Third International Symposium on High Performance
Computer Architecture (HPCA), San Antonio, Texas, February 1-5, 1997.
Fast Ethernet and ATM are two attractive network technologies for
interconnecting workstation clusters for parallel and distributed computing. This paper
compares network interfaces with and without programmable co-processors for the two types
of networks using the U-Net communication architecture to provide low-latency and
high-bandwidth communication. U-Net provides protected, user-level access to the network
interface and offers application-level round-trip latencies as low as 60 usec over Fast
Ethernet and 90 usec over ATM.
The design of the network interface and the underlying
network fabric have a large bearing on the U-Net design and performance. Network
interfaces with programmable co-processors can transfer data directly to and from user
space while others require aid from the operating system kernel. The paper provides
detailed performance analysis of U-Net for Fast Ethernet and ATM, including
application-level performance on a set of Split-C parallel benchmarks. These results show
that high-performance computing is possible on a network of PCs connected via Fast
Ethernet.
- A Language-Based Approach to
Protocol Construction, Anindya Basu, Mark Hayden, Greg Morrisett, and
Thorsten von Eicken. Proceedings of the ACM SIGPLAN Workshop on Domain Specific Languages
(WDSL), Paris, France, January 1997.
 User-level network architectures that provide applications with direct
access to network hardware have become popular with the emergence of high-speed networking
technologies such as ATM and Fast Ethernet. However, experience with user-level network
architectures such as U-Net [vEBBV95] has shown that building correct and efficient
protocols on such architectures is a challenge. To address this problem, Promela++, an
extension of the Promela protocol validation language has been developed. Promela++ allows
automatic verification of protocol correctness against programmer specified safety
requirements. Furthermore, Promela++ can be compiled to efficient C code. Thus far, the
results are encouraging: the C code produced by the Promela++ compiler shows performance
comparable to hand-coded versions of a relatively simple protocol.
- Low-Latency Communication on the IBM
RISC System/6000 SP, Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel,
and Thorsten von Eicken. Proceedings of Supercomputing '96, Pittsburgh, PA, November,
1996.
 The IBM SP is one of the most powerful commercial MPPs, yet, in
spite of its fast processors and high network bandwidth, the SP's communication latency is
inferior to older machines such as the TMC CM-5 or Meiko CS-2. This paper investigates the
use of Active Messages (AM) communication primitives as an alternative to the standard
message passing in order to reduce communication overheads and to offer a good building
block for higher layers of software.
The first part of this paper describes an
implementation of Active Messages (SP AM) which islayered directly on top of the SP's
network adapter (TB2). With comparable bandwidth, SP AM's low overhead yields a round-trip
latency that is 40% lower than IBM MPL's. The second part of the paper demonstrates the
power of AM as a communication substrate by layering Split-C as well as MPI over it.
Split-C benchmarks are used to compare the SP to other MPPs and show that low message
overhead and high throughput compensate for SP's high network latency. The MPI
implementation is based on the freely available MPICH version and achieves performance
equivalent to IBM's MPI-F on the NAS benchmarks.
- LogP: A Practical Model of Parallel Computation,
Communications of the ACM, November 1996, Vol. 39, No.11. David E. Culler, Richard M.
Karp, David Patterson, Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Ramesh
Subramonian and Thorsten von Eicken.
- Low-Latency
Communication over Fast Ethernet, Matt Welsh, Anindya Basu, Thorsten von Eicken.
Proceedings of Euro-Par '96, Lyon, France, August 27-29, 1996.
Fast Ethernet (100Base-TX) can provide a low-cost alternative to more
esoteric network technologies for high-performance cluster computing. We use a network
architecture based on the U-Net approach to implement low-latency and high-bandwidth
communication over Fast Ethernet, with performance rivaling (and in some cases exceeding)
that of 155 Mbps ATM. U-Net provides protected, user-level access to the network interface
and enables application-level round-trip latencies of less than 60 usec over Fast
Ethernet.
- U-Net: A User-Level
Network Interface for Parallel and Distributed Computing, Anindya Basu, Vineet
Buch, Werner Vogels, Thorsten von Eicken. Proceedings of the 15th ACM Symposium on
Operating Systems Principles (SOSP), Copper Mountain, Colorado, December 3-6, 1995.
 The U-Net communication architecture provides processes with a virtual
view of a network interface to enable user-level access to high-speed communication
devices. The architecture, implemented on standard workstations using off-the-shelf ATM
communication hardware, removes the kernel from the communication path, while still
providing full protection.
The model presented by U-Net allows for the
construction of protocols at user level whose performance is only limited by the
capabilities of network. The architecture is extremely flexible in the sense that
traditional protocols like TCP and UDP, as well as novel abstractions like Active Messages
can be implemented efficiently. A U-Net prototype on an 8-node ATM cluster of standard
workstations offers 65 microseconds round-trip latency and 15 Mbytes/sec bandwidth. It
achieves TCP performance at maximum network bandwidth and demonstrates performance
equivalent to Meiko CS-2 and TMC CM-5 supercomputers on a set of Split-C benchmarks.
- Delivering High-Performance Communication to the
Application-Level. Proceedings of the Third IEEE Workshop on the Architecture and
Implementation of High Performance Communication Subsystems (HPCS'95), August 1995. Werner
Vogels and Thorsten von Eicken.
- Low-Latency
Communication over ATM Networks using Active Messages, von Eicken, T., V.
Avula, A. Basu, V. Buch. Presented at Hot Interconnects II, Aug 1994, Palo Alto, CA. An
abridged version of this paper appears in IEEE Micro Magazine, Feb. 1995.
Recent developments in
communication architectures for parallel machines have made significant progress and
reduced the communication overheads and latencies by over an order of magnitude as
compared to earlier proposals. This paper examines whether these techniques can carry over
to clusters of workstations connected by an ATM network even though clusters use standard
operating system software, are equipped with network interfaces optimized for stream
communication, do not allow direct protected user-level access to the network, and use
networks without reliable transmission or flow control.
In a first part, this paper describes the differences
in communication characteristics between clusters of workstations built from standard
hardware and software components and state-of-the-art multiprocessors. The lack of flow
control and of operating system coordination affects the communication layer design
significantly and requires larger buffers at each end than on multiprocessors. A second
part evaluates a prototype implementation of the low-latency Active Messages communication
model on a Sun workstation cluster interconnected by an ATM network. Measurements show
application-to-application latencies of about 20 microseconds for small messages which is
roughly comparable to the Active Messages implementation on the Thinking Machines CM-5
multiprocessor.
|