Active Messages: A Mechanism for Integrated
Communication and Computation
Thorsten von Eicken,
David E. Culler, Seth Copen Goldstein, Klaus Erick Schauser, 1992
Original Notes by Yu Zhang, February 19, 1998
Notes revised by ted bonkenburg (tb@cs) and presented 8
April 1999
Highlevel Idea
1. To achieve high performance in large scale
multiprocessor machines, computation and communication must be overlapped.
2. This implies that the communication model should be
asynchronous.
3. Active Messages is a communication model
that is highly supported by the underlying hardware in parallel machines.
Traditional Send/Receive Model
1. Synchronous model: The send blocks until the corresponding receive is
executed, only then is data transfered (3-phase protocol)
- pros: simple, no pre-allocated buffer required at
the two ends
- cons: network latency can't be hidden
2.
Asynchronous send: The send is non-blocking. The
message layer buffers the message until the network port is available. The
message will be buffered at the receiver waiting for dispatching.
- pros: Overlapping of computation and communication.
- cons: Large overhead due to buffer management.
Overlapping is only effective for large messages.
Active Message (AM) Model
- Each message contains at its head the address of a
user-level handler
- The handler is executed on message arrival with the
message body as argument. Its role is to get the message out of the network
and into the ongoing computation of the recipient, i.e. either:
- store the arriving data in preallocated data
structure of the recipient, or
- in the case of remote service requests,
immediately reply to the requester.
- The handler must execute quickly and to completion.
(i.e. non-blocking)
Design characteristics:
- Require a uniform code image on all nodes -- the
sender specifies the address of the handler to be invoked at the
recipient. Of course, this can be solved with one level of
indirection.
- No need for buffering (except as required for
network transport) -- small messages become attractive!
- Primitive scheduling : the handlers interrupt the
computation immediately upon message arrival and execute to completion.
- Deadlock Avoidance: Nodes must continuously accept
incoming messages. Reply handlers should not block in order to allow
messages to "retire".
Implementation of AM on Message-passing Architecture
The CM-5 differs from the nCUBE/2 in a number of
ways:
- It has two disjoint networks, which solves the
deadlock problem.
- It has user level access to the network interface
and timesharing among user processes.
- The CM-5 limits transfer size to 24 bytes.
- Network does not guarantee packet ordering.
- No DMA - instead, two memory mapped FIFOs.
- Interrupt costs very high.
This creates different implementations of AM:
nCUBE/2
- messages received thru interrupt.
- packet
ordering free
- additional buffers to avoid deadlock
- msgs
sent w/ kernel trap and DMA
- Tc =
30us, Tb = 0.45us
|
CM-5
- messages recvd by polling then calling handler
- setup required on receiving end in order to get packet
ordering
- deadlock avoidance thru using one network for requests,
other for replies
- msgs sent by stuffing memory-mapped FIFO queue.
- Tc = 23us, Tb = 0.12us
|
Split-C Programming
Model
Provides non-blocking
remote-memory operations in C (over AM).
- PUT: copies local memory
block into remote memory at address specified by the sender.
- GET: retrieves a block of remote memory and makes a
local copy
Message Driven Architecures -vs- AM
Message Driven Architecture
- Role of
handler: arbitrary computation which can suspend.
- dynamic allocation
- complex
scheduling
|
Active Messages
- Role of handler: get the message out of the network and
into the ongoing computation of the recipient.
- buffer only for network transport
- simple scheduling
|
TAM ---- Simulating Message Driven
Architectures with AM
- activation tree and embedded scheduling queue
- active frame---for each function call,
allocation unit
- thread --- execution unit
- inlet --- for each frame, receive msg, store
data, schedule thread
- continuation vector --- scheduling queue of
enabled threads
- locality of computation is improved by TAM
scheduling hierarchy
H/W Support for AM
1. Network Interface:reduce
the overhead of composing a msg, esp for small msgs
- DMA transfer for large msgs (large message is a
small one w/ DMA transfer tacked on)
- small msgs
- direct communication between processor and
network interface
- compose msg in the register of network interface
coprocessor
- memory mapping: memory <---> network
interface
- use register set to compose and consume msgs ---
reuse of msg data
- support multiple network channels: allow multiple
atomic msg compositions at the same time
- user-level access to network interface: h/w
protection for sending & protection for receiving by MMU
- accelerate frequent msg
2. Processor support for msg handlers:
- multiplex processor between computation threads and
handlers
- tradition: interruption (flush pipeline, enter
the kernel, crawl out to user handler, trap back to kernel, return to
interrupted computation)
- fast polling (when msg frequency is high)
- user-level interrupts: problem exists
- PC injection: swap between computation PC and
handler PC --- minimal form of multithreading
- dual processors key: communication
between them