7 Native C Ensemble Application Interface (CE)
The C application interface is very similar in design to the ML
interface. It is located in directory ce. It has been
modified from the original ML interface, so as to fit better into
the C language (type-system and native data structures).
There are seven callbacks a C application needs to define in order
to work with Ensemble. These are:
-
install(env,ls,vs) : called whenever a new view is installed.
-
exit() :called when the member leaves.
-
receive_cast(env, origin, num, iovl) :
called with the origin, an iovec array (and its length)
whenever a mulicast message arrives.
-
receive_send(env, origin, num, iovl) :
called with the origin, an iovec array (and its length)
whenever a point-to-point message arrives.
-
flow_block(env, origin, onoff) :
called whenever there are flow-control problems, and
the application should refrain from sending messages until further
notice.
-
block(env) :
called whenever a view change is forthcoming. All
applications are blocked, the old view is stabilized,
cleaned, and way is made for the new view.
-
heartbeat(env, time) :
called every timeout. The timeout is specified in the jops
structure. Timers are not exact, this callback may be called at
inaccurate times, or more often than neccessary. If accuracy is
required, the application should check the time argument.
The environment argument which is the first argument in all seven
callbacks is registered when a C-application interface is created.
The types of the callbacks are as follows:
typedef int ce_rank_t ;
typedef int ce_len_t ;
typedef void *ce_env_t ;
typedef double ce_time_t ;
typedef void (*ce_appl_install_t)(ce_env_t, ce_local_state_t*, ce_view_state_t*);
typedef void (*ce_appl_exit_t)(ce_env_t) ;
typedef void (*ce_appl_receive_cast_t)(ce_env_t, ce_rank_t, int, ce_iovec_array_t) ;
typedef void (*ce_appl_receive_send_t)(ce_env_t, ce_rank_t, int, ce_iovec_array_t) ;
typedef void (*ce_appl_flow_block_t)(ce_env_t, ce_rank_t, ce_bool_t) ;
typedef void (*ce_appl_block_t)(ce_env_t) ;
typedef void (*ce_appl_heartbeat_t)(ce_env_t, ce_time_t) ;
A ce_appl_intf_t is the type of a C application interface
(cappl). It can be created by the constructor ce_create_intf. There is no need for a destructor because Ensemble
frees the interface-structure and all related memory after the exit
callback is invoked. An application interface is opaque, it can be
used to create and endpoint, and join a group. It cannot be used to
join more than a single group.
typedef struct ce_appl_intf_t ce_appl_intf_t ;
The constructor takes the above handlers as parameters, as well as
an environment variable.
ce_appl_intf_t*
ce_create_intf(
ce_env_t env,
ce_appl_exit_t exit,
ce_appl_install_t install,
ce_appl_flow_block_t flow_block,
ce_appl_block_t block,
ce_appl_receive_cast_t cast,
ce_appl_receive_send_t send,
ce_appl_heartbeat_t heartbeat
);
The initial operation used to initiate a CE application is
ce_Init. It initializes the internal Ensemble data structures, and
processes command line arguments.
void ce_Init(int argc, char **argv) ;
After a C application completes initialization it should pass control
the Ensemble main loop via ce_Main_loop.
void ce_Main_loop ();
In order to Join a group, the ce_Join operation should be used.
void ce_Join(ce_jops_t *ops, ce_appl_intf_t *c_appl) ;
7.1 Group operations
Similarly to the ML interface, the set of supported operations is:
Leave, Cast, Send, Send1, Prompt, Suspect, XferDone, Rekey,
ChangeProtocol, and ChangeProperties. Messages are arrays of
IO-vectors (iovecs), or C memory chunks. The application can
send and receive iovec-arrays.
Multicast an iovec-array to the group.
void ce_Cast(
ce_appl_intf_t *c_appl,
int num,
ce_iovec_array_t iovl
) ;
Send a point-to-point message to a set of group members.
void ce_Send(
ce_appl_intf_t *c_appl,
int num_dests,
ce_rank_array_t dests,
int num,
ce_iovec_array_t iovl
) ;
Send a point-to-point message to the specified group member.
void ce_Send1(
ce_appl_intf_t *c_appl,
ce_rank_t dest,
int num,
ce_iovec_array_t iovl
) ;
The control actions, are the same as the ML actions.
Leave a group. Following this downcall, exit will be called,
freeing the cappl.
void ce_Leave(ce_appl_intf_t *c_appl) ;
Ask for a new View.
void ce_Prompt(
ce_appl_intf_t *c_appl
);
Report specified group members as failure-suspected.
void ce_Suspect(
ce_appl_intf_t *c_appl,
int num,
ce_rank_array_t suspects
);
Inform Ensemble that the state-transfer is complete.
void ce_XferDone(
ce_appl_intf_t *c_appl
) ;
Ask the system to rekey.
void ce_Rekey(
ce_appl_intf_t *c_appl
) ;
Request a protocol change. The protocol_name is a string
specifying the exact set of layers to use. The string is a colon
separated list of layers, for example:
Top:Heal:Switch:Leave:Inter:Intra:Elect:Merge:Sync:Suspect:Stable:
Vsync:Frag_Abv:Top_appl:Frag:Pt2ptw:Mflow:Pt2pt:Mnak:Bottom
void ce_ChangeProtocol(
ce_appl_intf_t *c_appl,
char *protocol_name
) ;
Request a protocol change, specifying properties.
properties is a string containing a colon separated list of
properties. For example:
"Gmp:Sync:Heal:Switch:Frag:Suspect:Flow:Xfer".
The system deduces a protocol stack that abides by these properties.
void ce_ChangeProperties(
ce_appl_intf_t *c_appl,
char *properties
) ;
7.2 Integration of other sockets into the main loop
Ensemble works in an event driven fashion, where events can either
come from the network or the user. The system runs a loop that is
split between (1) waiting for input on incoming sockets using a
select system call (2) Processing local
application send/recv and internal events.
The application hands over control to Ensemble after initialization.
The application may wish to wait on its own sockets, e.g., stdin (on
Unix). To this end, we also support adding, removing, and putting
handlers on sockets.
ce_handler_t is the type of handler called when there is input
to process on a socket.
typedef void (*ce_handler_t)(void*);
ce_AddSockRecv adds a socket to the list Ensemble listens to.
When input on the socket occurs, this handler will be invoked
on the specified environment variable.
void ce_AdddSockRecv(
CE_SOCKET socket,
ce_handler_t handler,
ce_env_t env
);
ce_RmvSockRecv is called to remove a socket from the list
Ensemble listens to.
void ce_RmvSockRecv(
CE_SOCKET socket
);
7.3 Memory management
The convention used throughout is that all
data-structures passed from C to ML are consumed by ML, and all
data-structures passed from ML to C are owned by the C side (hence
must be freed). This rule holds for all structures and data apart from
the iovec-arrays.
Ensemble does not copy messages from C to the ML heap, rather, it
separates C-memory and ML memory completely. Messages are received
from the network and read directly into C-buffers. Sent iovecs are
fragmented and sent directly on the network. Messages must be buffered
until all group members reliably receive them. To this end, a
reference counting scheme is used to track iovec liveness. When an
iovec's reference count reaches zero, it is freed. In other words,
iovec's are owned by Ensemble. They are received either from the
user, or the network.
On linux, the type of an iovec is:
typedef struct iovec ce_iovec_t ;
typedef ce_iovec_t *ce_iovec_array_t;
To get better control of the iovec memory system, the alloc and
free functions can be set by the user. The definitions are in
lib/mm.h.
These define the types of alloc and free functions.
typedef void* (*mm_alloc_t)(int);
typedef void (*mm_free_t)(char*);
The actual functions called to free and allocate iovec's.
mm_alloc_t mm_alloc_fun;
mm_free_t mm_free_fun;
Use these functions to set alloc and free. Be careful to
do this exactly once at application initialization, before
starting Ensemble.
void set_alloc_fun(mm_alloc_t f);
void set_free_fun(mm_free_t f);
The upshot of this is that when a user sends or casts a message,
Ensemble takes over the message body. When a message is
delivered to the application, the user may copy it, or perform any
read-only operation while in the receive callback. The application may
not modify a received iovec, or assume it owns it.
7.4 The flat interface
Using iovecs is a little complex for simple applications,
therefore, a simplified ``flat'' interface exists.
The flat_receive callbacks take a C memory chunk, with it's length as
arguments. This releases the application from merging together the
set of buffers that consist an iovec-array, as well as releasing that
array.
typedef void (*ce_appl_flat_receive_cast_t)(ce_env_t, ce_rank_t, ce_len_t, ce_data_t) ;
typedef void (*ce_appl_flat_receive_send_t)(ce_env_t, ce_rank_t, ce_len_t, ce_data_t) ;
Create a standard application interface using flat receive callbacks.
ce_appl_intf_t*
ce_create_flat_intf(
ce_env_t env,
ce_appl_exit_t exit,
ce_appl_install_t install,
ce_appl_flow_block_t flow_block,
ce_appl_block_t block,
ce_appl_flat_receive_cast_t cast,
ce_appl_flat_receive_send_t send,
ce_appl_heartbeat_t heartbeat
);
Cast and Send operations that work with buffers instead of iovec-arrays.
void ce_flat_Cast(
ce_appl_intf_t *c_appl,
ce_len_t len,
ce_data_t buf
) ;
void ce_flat_Send(
ce_appl_intf_t *c_appl,
int num_dests,
ce_rank_array_t dests,
ce_len_t len,
ce_data_t buf
) ;
void ce_flat_Send1(
ce_appl_intf_t *c_appl,
ce_rank_t dest,
ce_len_t len,
ce_data_t buf
) ;
7.5 An example
This section shows how to use the CE interface to write applications.
We walk through the ce/ce_mtalk.c demo program.
ce/ce_mtalk.c, similarly to demo/mtalk.ml,
is a multi-person talk program. Messages are read from the user via stdin, and multicasted to the network.
state_t is the state structure used by the program. It is the
environment variable registered in the C-interface. The state contains
the current view information, a pointer to its cappl, and a flag
indicating if we are blocked.
typedef struct state_t {
ce_local_state_t *ls;
ce_view_state_t *vs;
ce_appl_intf_t *intf ;
int blocked;
} state_t;
A helper function to multicast a message if we are not blocked.
We use the flat interface, to save the messy handling of iovec's.
void cast(state_t *s, char *msg){
if (s->blocked == 0)
ce_flat_Cast(s->intf, strlen(msg), msg);
}
A handler for stdin. This callback is called whenever there is input
on the socket. The handler multicasts any message the user types on the
screen. Be careful not to send messages if we are blocked.
void stdin_handler(void *env) {
state_t *s = (state_t*)env;
char buf[100], *tmp;
int len ;
fgets(buf, 100, stdin);
len = strlen(buf);
if (len>=100)
/* string too long, dumping it.
*/
return;
tmp = ce_copy_string(buf);
TRACE2("Read %s:", tmp);
cast(s, tmp);
}
There is nothing special to do if we leave the group, the application
essentially halts.
void main_exit(void *env)
When a new view arrives, update the environment structure. Do not
forget to free the old view structure.
void main_install(void *env, ce_local_state_t *ls, ce_view_state_t *vs) {
state_t *s = (state_t*) env;
ce_view_full_free(s->ls,s->vs);
s->ls = ls;
s->vs = vs;
s->blocked =0;
printf("%s nmembers=%d", ls->endpt, ls->nmembers);
}
Ignore flow control problems. We are not suppose to have any of
these, we are very low bandwidth.
void main_flow_block(void *env, ce_rank_t rank, ce_bool_t onoff)
Mark our blocked flag.
void main_block(void *env) {
state_t *s = (state_t*) env;
s->blocked=1;
}
Print out any message that we receive. Be careful not to free the
received message.
void main_recv_cast(void *env, int rank, ce_len_t len, char *msg) {
state_t *s = (state_t*) env;
printf("recv_cast <- %d msg=%s", rank, msg);
}
Ignore send messages, we are not supposed to get any of these.
void main_recv_send(void *env, int rank, ce_len_t len, char *msg) {
}
Ignore heartbeats.
void main_heartbeat(void *env, double time) { }
Create a join options structure, and join the group ``ce_mtalk''.
Use a regular virtually-synchronous stack. Put a handler on stdin such that whenever there is input, it will be called.
There is no need to set the transport in the join-options structure,
the system uses the environment variable ENS_MODES in this case.
void join() {
ce_jops_t *jops;
ce_appl_intf_t *main_intf;
state_t *s;
/* The rest of the fields should be zero. The
* conversion code should be able to handle this.
*/
jops = record_create(ce_jops_t*, jops);
record_clear(jops);
jops->hrtbt_rate=10.0;
// jops->transports = ce_copy_string("UDP");
jops->group_name = ce_copy_string("ce_mtalk");
jops->properties = ce_copy_string(CE_DEFAULT_PROPERTIES);
jops->use_properties = 1;
s = (state_t*) record_create(state_t*, s);
record_clear(s);
main_intf = ce_create_flat_intf(s,
main_exit, main_install, main_flow_block,
main_block, main_recv_cast, main_recv_send,
main_heartbeat);
s->intf= main_intf;
ce_Join (&jops, main_intf);
ce_AddSockRecv(0, stdin_handler, s);
}
The main entry point, initialize the ML side, process command line
arguments, join the ce_mtalk group, and turn control over
to the Ensemble event loop.
int main(int argc, char **argv) {
ce_Init(argc, argv); /* Call Arge.parse, and appl_process_args */
join();
ce_Main_loop ();
return 0;
}
7.6 Outboard mode
It is possible to run any CE application through a remote Ensemble
server. Such a configuration is called an ``outboard'' configuration.
The idea is to run a daemon on the local host that listens to
TCP connections on a specific port, the daemon provides Ensemble
services to connected clients. Such services include joining/leaving groups,
and sending/receiving multicast and point-to-point
messages on these groups.
A CE application can be configured to run in outboard mode by linking
with the libceo library (suffix .a on Unix, .lib
on WIN32). The user must then make sure that the Ensemble daemon is
running, simply run the ce_outboard executable.
Using a daemon configuration has several benefits as well as some
drawbacks. The advantages are:
-
The library to link with is orders
of a magnitude smaller than the full (inboard) Ensemble library.
- The user-process is completely separated from the Ensemble
server. This allows better debugging, and also facilitates writing simple
interfaces to other languages (e.g., Java, Ada, ...).
The disadvantage is performance loss. Each message now has to travel
through a socket and another process before being sent on the network;
vice-versa for received messages. This may outweigh the benefits of
simple client code, and a minimal sized library.
The current port used by the outboard mode is 5002. This is
configurable by running ce_outboard with the command line
argument -tcp_port <port_num>, and modifying the
OUTBOARD_TCP_PORT parameter in ce/ce_outboard_comm.h.
Care was taken to optimize memory consumption. Messages are sent
zero-copy from the client, and they are copied once only into the
server's buffers. A sent io-vector is consumed by the send
function. Received messages are allocated at the client's buffers and
handed to the application. After the application's receive callback,
io-vectors are released. It was possible at this point to allow the
application to take control of the io-vector, yet we chose to conform
with the memory convections of the inboard mode.
7.7 Thread-safety
A thread-safe version of the library is also provided, it exports the
exact same interface as the basic library. To use it link with libce_mt.so, or libceo_mt.so. For WIN32 systems link with
.lib instead. The thread-safe library requires the application
to synchronize its threads so they will not perform actions (send,
cast, prompt, etc.) on a group while it is stabilizing. There are
several thread-safe applications under the ce directory: ce_rand_mt.c, ce_perf_mt.c, and ce_mtalk_mt.c. These applications
use a lock to ensure that sensitive group-state is accessed safely.
Threads atomically check group-state before performing an Ensemble
action.
The thread-safe library is designed as a wrapper around the basic
library. A single thread runs both Ensemble main-loop and application
callback handlers; this thread is known as the Ensemble
thread. Other threads are refered to as user-threads. When a
user-thread performs an action outside of a handler, the action is
stored in a pending queue. A byte is sent through a socket to the
Ensemble thread, notifying it that there is pending work to do.
Asynchronously, the Ensemble thread ``wakes up'', consumes the queue,
and performs all pending actions. Any actions invoked in the interim
will also be stored in the pending queue; to be consumed along with
the rest.
Any action invoked from within a callback is performed directly when
the callback is completed and control returns to Ensemble.
Since a single thread performs the Ensemble main-loop as well as all user
callbacks, callbacks must be short. Long-term computations should not be performed in the context of a callback.
There are three sensitive periods in which issuing Ensemble actions is
not allowed, these are when joining, leaving, and blocking. A group is in:
- joining state: between ce_Join and
the first install callback.
- leaving state: between ce_Leave and the exit
callback.
- blocking state: between the block callback and the
succeeding install callback.
An example of a simple multi-threaded application is provided in ce/ce_mtalk_mt.c.
The overhead of adding thread-safety is 10% in the worst case, and
normally much less than that. This should be acceptable for most
applications.
7.8 A multi-threaded multi-person chat program
This program is a multi-threaded version of ce_mtalk.c
Here, we walk through it and explain the interface and how to
use it.
Include the system-independent thread header file, so we'll be
able to use locks.
#include "ce_trace.h"
#include "ce.h"
#include "ce_threads.h"
#include <stdio.h>
#include <memory.h>
#include <malloc.h>
The NAME variable is used for internal tracing purposes of
CE. There is no need to set it for standard user programs.
#define NAME "CE_MTALK_MT"
Apart for standard view state, the state structure keeps track
of the current status of the group: blocked, joining, or leaving.
typedef struct state_t {
ce_local_state_t *ls;
ce_view_state_t *vs;
ce_appl_intf_t *intf ;
int blocked;
int joining;
int leaving;
ce_lck_t *mutex;
} state_t;
Although we must define these callbacks, they do nothing in this
program.
void main_exit(void *env)
{}
void
main_flow_block(void *env, ce_rank_t rank, ce_bool_t onoff)
{}
void
main_recv_send(void *env, int rank, ce_len_t len, char *msg)
{}
void
main_heartbeat(void *env, double time)
{}
main_install updates the view state. A lock must be taken to
protect view state, as other threads may concurrently read the state.
void
main_install(void *env, ce_local_state_t *ls, ce_view_state_t *vs)
{
state_t *s = (state_t*) env;
ce_lck_Lock(s->mutex); {
ce_view_full_free(s->ls,s->vs);
s->ls = ls;
s->vs = vs;
s->blocked =0;
s->joining =0;
printf("%s nmembers=%d", ls->endpt, ls->nmembers);
TRACE2("main_install",s->ls->endpt);
} ce_lck_Unlock(s->mutex);
}
The group is blocked, lock the state structure, and update the blocked
flag. This notifies other threads not to attempt sending messages
until the upcoming install callback. A lock must be taken to protect view
state, as other threads may read it.
void
main_block(void *env)
{
state_t *s = (state_t*) env;
ce_lck_Lock(s->mutex); {
s->blocked=1;
} ce_lck_Unlock(s->mutex);
}
Received a message, print who sent it and its content.
void
main_recv_cast(void *env, int rank, ce_len_t len, char *msg)
{
printf("%d -> msg=%s", rank, msg); fflush(stdout);
}
get_input is a non-terminating function performed by the user-thread of
this program. In an infinite loop, read a line from stdin,
and multicast it to the group. Prior to sending, check that the group is not
blocked/joining/leaving. Status flags are shared information, and
may be updated concurrently by an install or block
callback. Hence, a lock is taken to protect access to the flags.
void
get_input(void *env)
{
state_t *s = (state_t*)env;
char buf[100], *msg;
int len ;
while (1) {
TRACE("stdin_handler");
fgets(buf, 100, stdin);
len = strlen(buf);
if (len>=100)
/* string too long, dumping it.
*/
return;
msg = ce_copy_string(buf);
TRACE2("Read: ", msg);
ce_lck_Lock(s->mutex); {
if (s->joining || s->leaving || s->blocked)
printf("Cannot send while group is joining/leaving/blocked");
else {
ce_flat_Cast(s->intf, strlen(msg), msg);
}
} ce_lck_Unlock(s->mutex);
}
}
Initialize the state structure, and join the ``ce_mtalk'' Ensemble group.
Take care to initialize the lock, and set the joining flag. The flag
will be unset, allowing sending messages, in the first install callback.
state_t *
join(void)
{
ce_jops_t *jops;
ce_appl_intf_t *main_intf;
state_t *s;
/* The rest of the fields should be zero. The
* conversion code should be able to handle this.
*/
jops = record_create(ce_jops_t*, jops);
record_clear(jops);
jops->hrtbt_rate=10.0;
jops->transports = ce_copy_string("DEERING");
jops->group_name = ce_copy_string("ce_mtalk");
jops->properties = ce_copy_string(CE_DEFAULT_PROPERTIES);
jops->use_properties = 1;
s = (state_t*) record_create(state_t*, s);
record_clear(s);
main_intf = ce_create_flat_intf(s,
main_exit, main_install, main_flow_block,
main_block, main_recv_cast, main_recv_send,
main_heartbeat);
s->intf= main_intf;
s->mutex = ce_lck_Create();
s->joining = 1;
ce_Join (jops, main_intf);
return s;
}
Initialize Ensemble, start the reader thread, and go to sleep.
int
main(int argc, char **argv)
{
state_t *s;
ce_Init(argc, argv); /* Call Arge.parse, and appl_process_args */
/* Join the group
*/
s = join();
/* Create a thread to read input from the user.
*/
ce_thread_Create(get_input, s, 10000);
ce_Main_loop ();
return 0;
}
7.9 Notes
Of the four transports supported by Ensemble : NETSIM, UDP, TCP, and
DEERING, NETSIM is not supported for the thread-safe library. A socket
is used internally, and NETSIM does not allow any
external communication. Hence, it is unsupported.