Active Messages: A Mechanism for Integrated Communication and Computation

Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, Klaus Erick Schauser, 1992

Original Notes by Yu Zhang, February 19, 1998
Notes revised by ted bonkenburg (tb@cs) and presented 8 April 1999

Highlevel Idea

1. To achieve high performance in large scale multiprocessor machines, computation and communication must be overlapped.
2. This implies that the communication model should be asynchronous.
3. Active Messages is a communication model that is highly supported by the underlying hardware in parallel machines.

Traditional Send/Receive Model

1. Synchronous model: The send blocks until the corresponding receive is executed, only then is data transfered (3-phase protocol)

pros: simple, no pre-allocated buffer required at the two ends
cons: network latency can't be hidden

2. Asynchronous send: The send is non-blocking. The message layer buffers the message until the network port is available. The message will be buffered at the receiver waiting for dispatching.

pros: Overlapping of computation and communication.
cons: Large overhead due to buffer management. Overlapping is only effective for large messages.

Active Message (AM) Model

Each message contains at its head the address of a user-level handler
The handler is executed on message arrival with the message body as argument. Its role is to get the message out of the network and into the ongoing computation of the recipient, i.e. either:
- store the arriving data in preallocated data structure of the recipient, or
- in the case of remote service requests, immediately reply to the requester.
The handler must execute quickly and to completion. (i.e. non-blocking)

Design characteristics:

Require a uniform code image on all nodes -- the sender specifies the address of the handler to be invoked at the recipient. Of course, this can be solved with one level of indirection.
No need for buffering (except as required for network transport) -- small messages become attractive!
Primitive scheduling : the handlers interrupt the computation immediately upon message arrival and execute to completion.
Deadlock Avoidance: Nodes must continuously accept incoming messages. Reply handlers should not block in order to allow messages to "retire".

Implementation of AM on Message-passing Architecture

The CM-5 differs from the nCUBE/2 in a number of ways:

It has two disjoint networks, which solves the deadlock problem.
It has user level access to the network interface and timesharing among user processes.
The CM-5 limits transfer size to 24 bytes.
Network does not guarantee packet ordering.
No DMA - instead, two memory mapped FIFOs.
Interrupt costs very high.

This creates different implementations of AM:

nCUBE/2

messages received thru interrupt.
packet ordering free
additional buffers to avoid deadlock
msgs sent w/ kernel trap and DMA
Tc = 30us, Tb = 0.45us
CM-5

messages recvd by polling then calling handler
setup required on receiving end in order to get packet ordering
deadlock avoidance thru using one network for requests, other for replies
msgs sent by stuffing memory-mapped FIFO queue.
Tc = 23us, Tb = 0.12us

Split-C Programming Model

Provides non-blocking remote-memory operations in C (over AM).

PUT: copies local memory block into remote memory at address specified by the sender.
GET: retrieves a block of remote memory and makes a local copy

Message Driven Architecures -vs- AM

Message Driven Architecture

Role of handler: arbitrary computation which can suspend.
- dynamic allocation
- complex scheduling

Active Messages

Role of handler: get the message out of the network and into the ongoing computation of the recipient.
- buffer only for network transport
- simple scheduling

TAM ---- Simulating Message Driven Architectures with AM

activation tree and embedded scheduling queue
- active frame---for each function call, allocation unit
- thread --- execution unit
- inlet --- for each frame, receive msg, store data, schedule thread
- continuation vector --- scheduling queue of enabled threads
locality of computation is improved by TAM scheduling hierarchy

H/W Support for AM

1. Network Interface:reduce the overhead of composing a msg, esp for small msgs

DMA transfer for large msgs (large message is a small one w/ DMA transfer tacked on)
small msgs
- direct communication between processor and network interface
- compose msg in the register of network interface coprocessor
- memory mapping: memory <---> network interface
use register set to compose and consume msgs --- reuse of msg data
support multiple network channels: allow multiple atomic msg compositions at the same time
user-level access to network interface: h/w protection for sending & protection for receiving by MMU
accelerate frequent msg

2. Processor support for msg handlers:

multiplex processor between computation threads and handlers
- tradition: interruption (flush pipeline, enter the kernel, crawl out to user handler, trap back to kernel, return to interrupted computation)
- fast polling (when msg frequency is high)
- user-level interrupts: problem exists
- PC injection: swap between computation PC and handler PC --- minimal form of multithreading
dual processors key: communication between them