Active Messages: a Mechanism for Integrated Communication and Computation

Notes by Yu Zhang, February 19, 1998

Highlevel Idea

1. Algorithmic communication model => essential properties of the communicati on mechanism:

communication overhead must be low
overlap and co-ordination of communication and on-going computation

2. Find a communication model compatible with h/w functionality.

Traditional Send/Receive Model

1. Synchromous model
    The send blocks until the corresponding receive is executed, only then is data transfered (3-phase protocol)
    pros: simple, no buffer required at the two ends
    cons: network latency can't be hidden
2. Asynchronous sending
    The send is non-blocking. The message layer buffers the message until the network port is available.
    The message will be buffered at the receiver waiting for dispatching.
    pros: overlapping of computation and communication
    cons: large overhead due to buffer management. Overlappin g is only effective for large messages.

Active Message (AM) Model

Each message contains at its head the address of a user-level handler
The handler is executed on message arrival with the message body as argument. Its role is to get the message out of the network and into the ongoing computation of the recipient, i.e. either

store the arriving data in preallocated data structure of the recipient, or
in the case of remote service requests, immediately replying to the requester.

The handler must execute quickly and to completion.

design characteristics:

Require a uniform code image on all nodes -- the sender specifies the address of the handler to be invoked at the recipient.
No need of buffering( except as required for network transport )-- small messages become attractive!
Primitive scheduling : the handlers interrupt the computation immediately upon message arrival and execute to completion.
Deadlock Avoidance: A handler can not block.

Q: How generally can the requirement of uniform code image be satisfied?
Q: As to deadlock avoidance, is the condition necessary and sufficient?

Implementation of AM on Message-passing Architecture
Different network and network interface supports in nCUBE/2 and CM-5 on

number of disjoint networks
user-level access to the network interface
packet size
packet ordering preservation
memory-mapped FIFOs
DMA

decide different implementation of AM:

trap to kernel when sending vs. stuff the memory-mapped outgoing FIFO
interruption vs. polling for message receiving
get packet ordering for free vs. additional set-up on receiving end or packer header information to preserve packet ordering
additional buffers for deadlock avoidance vs. one-way communication on two disjoint networks

Q: From the instruction breakdown of AM implementation on nCUBE/2, we see that AM specific crawl-out account for a substantial part of the overhead. Could you figure out how to cut down the crawl-out overhead or eliminate it?

Split-C Programming Model
provide split-phase, non-blocking, remote memory operations in C.

PUT
GET
completion flag for synchronization

programming pattern:

Receive-initiated data transmission by GET operation
To overlap communication and computation, receiver must initiate data transmissi on early enough before using the data

Q: The relation and difference of AM, shared memory, and memory-mapping communication in terms of mechanism and responsibility of programmers.

Using AM to support languages with dynamic parallelism (defeat the message driven architectures)

1. Message Driven Architectures

message: handler + data
Messages go into a scheduling queue on arrival.
Once a message reaches the head of the queue, the handler is executed with the data as arguments

Message Driven Architectures vs. Active Message

Message Driven

role of handler: arbitrary computation which can suspend. Thus

dynamic allocatoin: size and lifetime of scheduling queue are arbitrary
complex scheduling

Active Message

role of handler: get the message out of the network and into the ongoing computation of the recipient. Thus

buffer only for network transport
simple scheduling

Q: Why h/w support of message driven architecture is counter productive?
message-by-message scheduling inherent in the model results in short computation run-lengths. This lacking of locality prevents the utilization of large register sets.

2. TAM ---- Simulating Message Driven Architectures with AM
TAM(Thread Abstract Machine), a fine-grain parallel execution model based on AM.
Compiler-assisted allocation and scheduling:

activation tree and embedded scheduling queue

active frame---for each function call, allocation unit
thread --- execution unit
inlet --- for each frame, receive msg, store data, schedule thread
continuation vector --- scheduling queue of enabled threads

no allocation is required upon msg arrival
locality of computation is improved by TAM scheduling hierarchy
two-level ( function-thread) scheduling hierarchy
- run-time scheduling policy dynamically enhances locality
serve a frame until its continuation vector is empty
- compiler scheduling policy
enable a significant amount of computation only when a group of msgs have arrived

H/W Support for AM (almost all discussing points)
1. Network Interface
goal: reduce the overhead of composing a msg, esp for small msgs

DMA transfer for large msgs
small msgs

direct communication between processor and network interface
compose msg in the register of network interface coprocessor
memory mapping: memory <---> network interface

use register set to compose and consume msgs --- reuse of msg data
support multiple network channels: allow multiple atomic msg compositions at the same time
user-level access to network interface: h/w protection for sending & protection for receiving by MMU
accelerate frequent msg

2. Processor support for msg handlers
goal: efficiency of background computation

multiplex processor between computation threads and handlers

tradition: interruption (flush pipeline, enter the kernel, crawl out to user handler, trap back to kernel, return to interrupted computation)
fast polling (when msg frequency is high)
user-level interrupts: problem exists
PC injection: swap between computation PC and handler PC --- minimal form of multithreading

dual processors key: communication between them

Observation:

Programming models based on AM depends on usage patterns.
H/W support for AM depends on both usage pattern and engineering tradeoffs.