Active Messages: a Mechanism for Integrated Communication and Computation
Notes by Yu Zhang, February 19, 1998
Highlevel Idea
1. Algorithmic communication model => essential properties of the communicati on
mechanism:
- communication overhead must be low
- overlap and co-ordination of communication and on-going computation
2. Find a communication model compatible with h/w functionality.
Traditional Send/Receive Model
1. Synchromous model
The send blocks until the corresponding receive is executed, only then
is data transfered (3-phase protocol)
pros: simple, no buffer required at the two ends
cons: network latency can't be hidden
2. Asynchronous sending
The send is non-blocking. The message layer buffers the message until
the network port is available.
The message will be buffered at the receiver waiting for dispatching.
pros: overlapping of computation and communication
cons: large overhead due to buffer management. Overlappin g is only
effective for large messages.
Active Message (AM) Model
- Each message contains at its head the address of a user-level handler
- The handler is executed on message arrival with the message body as argument. Its role
is to get the message out of the network and into the ongoing computation of the
recipient, i.e. either
- store the arriving data in preallocated data structure of the recipient, or
- in the case of remote service requests, immediately replying to the requester.
- The handler must execute quickly and to completion.
design characteristics:
- Require a uniform code image on all nodes -- the sender specifies the address of the
handler to be invoked at the recipient.
- No need of buffering( except as required for network transport )-- small messages become
attractive!
- Primitive scheduling : the handlers interrupt the computation immediately upon message
arrival and execute to completion.
- Deadlock Avoidance: A handler can not block.
Q: How generally can the requirement of uniform code image be satisfied?
Q: As to deadlock avoidance, is the condition necessary and sufficient?
Implementation of AM on Message-passing Architecture
Different network and network interface supports in nCUBE/2 and CM-5 on
- number of disjoint networks
- user-level access to the network interface
- packet size
- packet ordering preservation
- memory-mapped FIFOs
- DMA
decide different implementation of AM:
- trap to kernel when sending vs. stuff the memory-mapped outgoing FIFO
- interruption vs. polling for message receiving
- get packet ordering for free vs. additional set-up on receiving end or packer header
information to preserve packet ordering
- additional buffers for deadlock avoidance vs. one-way communication on two disjoint
networks
Q: From the instruction breakdown of AM implementation on nCUBE/2, we see that AM
specific crawl-out account for a substantial part of the overhead. Could you figure
out how to cut down the crawl-out overhead or eliminate it?
Split-C Programming Model
provide split-phase, non-blocking, remote memory operations in C.
- PUT
- GET
- completion flag for synchronization
programming pattern:
- Receive-initiated data transmission by GET operation
- To overlap communication and computation, receiver must initiate data transmissi on
early enough before using the data
Q: The relation and difference of AM, shared memory, and memory-mapping communication
in terms of mechanism and responsibility of programmers.
Using AM to support languages with dynamic parallelism (defeat the
message driven architectures)
1. Message Driven Architectures
- message: handler + data
- Messages go into a scheduling queue on arrival.
- Once a message reaches the head of the queue, the handler is executed with the data as
arguments
Message Driven Architectures vs. Active Message
| Message Driven role of handler: arbitrary computation which can suspend.
Thus
- dynamic allocatoin: size and lifetime of scheduling queue are arbitrary
- complex scheduling
|
Active Message role of handler: get the message out of the network and into
the ongoing computation of the recipient. Thus
- buffer only for network transport
- simple scheduling
|
Q: Why h/w support of message driven architecture is counter productive?
message-by-message scheduling inherent in the model results in short computation
run-lengths. This lacking of locality prevents the utilization of large register sets.
2. TAM ---- Simulating Message Driven Architectures with AM
TAM(Thread Abstract Machine), a fine-grain parallel execution model based on AM.
Compiler-assisted allocation and scheduling:
H/W Support for AM (almost all discussing points)
1. Network Interface
goal: reduce the overhead of composing a msg, esp for small msgs
- DMA transfer for large msgs
- small msgs
- direct communication between processor and network interface
- compose msg in the register of network interface coprocessor
- memory mapping: memory <---> network interface
- use register set to compose and consume msgs --- reuse of msg data
- support multiple network channels: allow multiple atomic msg compositions at the same
time
- user-level access to network interface: h/w protection for sending & protection for
receiving by MMU
- accelerate frequent msg
2. Processor support for msg handlers
goal: efficiency of background computation
- multiplex processor between computation threads and handlers
- tradition: interruption (flush pipeline, enter the kernel, crawl out to user handler,
trap back to kernel, return to interrupted computation)
- fast polling (when msg frequency is high)
- user-level interrupts: problem exists
- PC injection: swap between computation PC and handler PC --- minimal form of
multithreading
- dual processors key: communication between them
Observation:
- Programming models based on AM depends on usage patterns.
- H/W support for AM depends on both usage pattern and engineering tradeoffs.