10 Heterogeneous Transports
Ensemble provides a flexible infrastructure for sending communication
across a variety of different communication transports. Not only can
different groups use different communication transports, but a single group
can support communication on multiple transports at the same time.
The design of the transport module is split into three parts:
-
The socket module:
-
Low-level system calls: send, sendto, recv etc.,
implemented in a system-independent fashion. The socket
directory contains the code. socket/u is a simple-minded
implementation that uses the Ocaml Unix library directly. A more
efficient version is located in socket/s, where native OS
io-vector send/recv facilities are used.
- Transports:
-
Self registering transports: Deering, UDP, TCP, NETSIM. These
use the low-level socket module calls to provide an abstract transport.
- Routers:
-
Uses a communication transport to
build Ensemble specific send/recv capabilities. Length field,
group id, and endpoint rank are added to each outgoing
message. Basic parsing is performed on received messages and sender
rank, group, and message length are extracted.
There are several routers in the route
subdirectory. signed.ml adds a 16-byte MD5 checksum to
each outgoing message. An agreed group-secret is used to key MD5,
providing group authentication. Incoming messages are stripped of
this header, and verified. unsigned.ml is the vanilla router.
The user can choose to use either one of the socket module
implementations. The socket module interface is defined in
socket/socket.mli. The unoptimized socket implementation
(usocket) represents message data as a Caml string and benefits from
native garbage collection. Its disadvantage is reduced
performance. The optimized socket library (ssocket) uses native C
io-vectors, and native operating-system scatter-gather message
send/receive facilities. This provides much better performance, and
zero-copy integration with C applications. The disadvantage is more
difficult integration with native ML values.
The transports are defined the trans subdirectory.
UDP in trans/udp.ml, TCP in trans/tcp.ml,
DEERING in trans/ipmc, and NETSIM in
trans/netsim.
The route subdirectory contains three routes: signed,
unsigned, and bypass.
10.1 Code walk-through
To provide better understanding of the design this section walks
through a configuration of the unsigned router, UDP transport,
and optimized socket library. We shall start from the bottom
and work our way up.
In file socket/s/sendrecv.c, there is code for sending an
array of C io-vectors and part of an ML string. The function takes
five arguments:
-
info_v : a structure describing a list of remote targets and a
socket through which to send messages.
- prefix_v : an ML string that prefixes the data
- ofs_v, len_v: the offset and length of the prefix to send
- iova_a : an array of io-vectors wrapped in an ML representation
value skt_sendtosv(
value info_v,
value prefix_v,
value ofs_v,
value len_v,
value iova_v
) {
int naddr=0, i, ret=0;
ocaml_skt_t sock=0 ;
skt_sendto_info_t *info ;
info = skt_Sendto_info_val(info_v);
send_msghdr.msg_iovlen = prefixed_gather(prefix_v, ofs_v, len_v, iova_v);
send_msghdr.msg_namelen = info->addrlen ;
sock = info->sock ;
naddr = info->naddr ;
for (i=0;i<naddr;i++) {
/* Send the message. Assume we don't block or get interrupted.
*/
send_msghdr.msg_name = (char*) &info->sa[i] ;
ret = sendmsg(sock, &send_msghdr, 0) ;
}
return Val_unit;
}
skt_sendtosv is hidden inside the socket library, and can
safely be used using Socket.sendtosv. The sendto_info
structure can be created from an array of target socket addresses, and
a sending socket.
type sendto_info
val sendto_info : socket -> Unix.sockaddr array -> sendto_info
val sendtosv : sendto_info -> buf -> ofs -> len -> Basic_iov.t array -> unit
The Hsys module makes access to sendtovs safer, and changes its type:
val sendtosv : sendto_info -> Buf.t -> ofs -> len -> Iovecl.t -> unit
(* Implementation *)
Iovec.Priv.sendtosv info
(Buf.string_of buf) (Buf.int_of_len ofs) (Buf.int_of_len len)
(Iovecl.to_iovec_array iovl)
Core Ensemble code, including the routers, does not use Socket calls
directly. Rather, it uses the Hsys module which wraps all calls with a
more type safe interface. Separate types are used for length, offset,
io-vector, and buffer.
The UDP implementation at trans/udp.ml uses Hsys in the
transmit function called x.
let x hdr ofs len iovl =
Hsys.sendtosv dests hdr ofs len iovl;
Iovecl.free iovl
The io-vector array is freed after the message is transmitted. The
reference count for an iovec-array is decremented on two occasions:
(1) it is sent on the network (2) it is handed to an application, and
the callback has completed. The iovec refcount is initially set to one
when the application sends it, and it is henceforth incremented
whenever a copy of it created. Ultimately, the refcount will be
decremented when the stability detection protocol determines that all
group members received the message.
10.2 Design of the routers
Many endpoints belonging to different groups can coexist in a single
Ensemble process. Each endpoint is identified by its connection
identifier. The internal representation of this id is given in module
Conn:
type id = {
version : Version.id ;
group : Group.id ;
stack : Stack_id.t ;
proto : Proto.id option ;
view_id : View.id option ;
sndr_mbr : sndr_mbr ;
dest_mbr : dest_mbr ;
dest_endpt : dest_endpt option
}
The id is mapped into a string using the Route.pack_of_conn
function. Ensemble uses MD5 for this mapping. The probability of a
collision, i.e., for two different endpoints to map onto a single
string, is 2-64 which is sufficient for our purposes.
val pack_of_conn : Conn.id -> Buf.t
The purpose of the route module is to create a single interface to
these various endpoints. The main type exported is
handlers. This is essentially a large array holding the set of
connection identifiers and the delivery function for each of
them. When a message is received by the bottom-most part of the
system, it is parsed by the socket code into an ML header that is a
string, and the rest of the message which is received into a
C-iovector. This information is later fed into the deliver
function.
val deliver : handlers -> Buf.t -> Buf.ofs -> Buf.len -> Iovecl.t -> unit
Deliver takes the current set of handlers, and a message, figures out
which endpoints need to receive this message and calls the appropriate
handlers.
A transmission function is abstracted as a type xmitf:
(* transmit an Ensemble packet, this includes the ML part, and a
* user-land iovecl.
*)
type xmitf = Buf.t -> Buf.ofs -> Buf.len -> Iovecl.t -> unit
The Router module has an API allowing the creation of send/recv
functions for connection-ids. It also allows installing and deleting
such functions. The unsigned router is a simple example of
using this functionality to create the basic, insecure,
router. It defines function f:
val f : unit ->
(Trans.rank -> Obj.t option -> Trans.seqno -> Iovecl.t -> unit) Route.t
This router will allow users to send (1) sender rank (2) ML object (3)
sequence number and (4) a user iovector array. The body of the code
calls Route.create where it mainly needs to define how it plans
on handling blast and merge. Blast is how to send
messages, merge is how to receive a message on behalf of several
connection ids.