Previous Contents Next

7   Native C Ensemble Application Interface (CE)

The C application interface is very similar in design to the ML interface. It is located in directory ce. It has been modified from the original ML interface, so as to fit better into the C language (type-system and native data structures).

There are seven callbacks a C application needs to define in order to work with Ensemble. These are:

The environment argument which is the first argument in all seven callbacks is registered when a C-application interface is created.

The types of the callbacks are as follows:

typedef int         ce_rank_t ;
typedef int         ce_len_t ;
typedef void       *ce_env_t ;
typedef double      ce_time_t ;

typedef void (*ce_appl_install_t)(ce_env_t, ce_local_state_t*, ce_view_state_t*);

typedef void (*ce_appl_exit_t)(ce_env_t) ;

typedef void (*ce_appl_receive_cast_t)(ce_env_t, ce_rank_t, int, ce_iovec_array_t) ;

typedef void (*ce_appl_receive_send_t)(ce_env_t, ce_rank_t, int, ce_iovec_array_t) ;

typedef void (*ce_appl_flow_block_t)(ce_env_t, ce_rank_t, ce_bool_t) ;

typedef void (*ce_appl_block_t)(ce_env_t) ;

typedef void (*ce_appl_heartbeat_t)(ce_env_t, ce_time_t) ;

A ce_appl_intf_t is the type of a C application interface (cappl). It can be created by the constructor ce_create_intf. There is no need for a destructor because Ensemble frees the interface-structure and all related memory after the exit callback is invoked. An application interface is opaque, it can be used to create and endpoint, and join a group. It cannot be used to join more than a single group.

typedef struct ce_appl_intf_t ce_appl_intf_t ;

The constructor takes the above handlers as parameters, as well as an environment variable.


ce_appl_intf_t*
ce_create_intf(
    ce_env_t env, 
    ce_appl_exit_t exit,
    ce_appl_install_t install,
    ce_appl_flow_block_t flow_block,
    ce_appl_block_t block,
    ce_appl_receive_cast_t cast,
    ce_appl_receive_send_t send,
    ce_appl_heartbeat_t heartbeat
);

The initial operation used to initiate a CE application is ce_Init. It initializes the internal Ensemble data structures, and processes command line arguments.


void ce_Init(int argc, char **argv) ;

After a C application completes initialization it should pass control the Ensemble main loop via ce_Main_loop.


void ce_Main_loop ();

In order to Join a group, the ce_Join operation should be used.

void ce_Join(ce_jops_t *ops, ce_appl_intf_t *c_appl) ;

7.1   Group operations

Similarly to the ML interface, the set of supported operations is: Leave, Cast, Send, Send1, Prompt, Suspect, XferDone, Rekey, ChangeProtocol, and ChangeProperties. Messages are arrays of IO-vectors (iovecs), or C memory chunks. The application can send and receive iovec-arrays.

Multicast an iovec-array to the group.

void ce_Cast(
    ce_appl_intf_t *c_appl,
    int num,
    ce_iovec_array_t iovl
) ;

Send a point-to-point message to a set of group members.

void ce_Send(
    ce_appl_intf_t *c_appl,
    int num_dests,
    ce_rank_array_t dests,
    int num,
    ce_iovec_array_t iovl
) ;

Send a point-to-point message to the specified group member.

void ce_Send1(
    ce_appl_intf_t *c_appl,
    ce_rank_t dest,
    int num,
    ce_iovec_array_t iovl
) ;

The control actions, are the same as the ML actions.

Leave a group. Following this downcall, exit will be called, freeing the cappl.

void ce_Leave(ce_appl_intf_t *c_appl) ;

Ask for a new View.

void ce_Prompt(
    ce_appl_intf_t *c_appl
);

Report specified group members as failure-suspected.

void ce_Suspect(
    ce_appl_intf_t *c_appl,
    int num,
    ce_rank_array_t suspects
);

Inform Ensemble that the state-transfer is complete.

void ce_XferDone(
    ce_appl_intf_t *c_appl
) ;

Ask the system to rekey.

void ce_Rekey(
    ce_appl_intf_t *c_appl
) ;

Request a protocol change. The protocol_name is a string specifying the exact set of layers to use. The string is a colon separated list of layers, for example: Top:Heal:Switch:Leave:Inter:Intra:Elect:Merge:Sync:Suspect:Stable: Vsync:Frag_Abv:Top_appl:Frag:Pt2ptw:Mflow:Pt2pt:Mnak:Bottom

void ce_ChangeProtocol(
    ce_appl_intf_t *c_appl,
    char *protocol_name
) ;

Request a protocol change, specifying properties. properties is a string containing a colon separated list of properties. For example: "Gmp:Sync:Heal:Switch:Frag:Suspect:Flow:Xfer". The system deduces a protocol stack that abides by these properties.

void ce_ChangeProperties(
    ce_appl_intf_t *c_appl,
    char *properties
) ;

7.2   Integration of other sockets into the main loop

Ensemble works in an event driven fashion, where events can either come from the network or the user. The system runs a loop that is split between (1) waiting for input on incoming sockets using a select system call (2) Processing local application send/recv and internal events.

The application hands over control to Ensemble after initialization. The application may wish to wait on its own sockets, e.g., stdin (on Unix). To this end, we also support adding, removing, and putting handlers on sockets.

ce_handler_t is the type of handler called when there is input to process on a socket.

typedef void (*ce_handler_t)(void*);

ce_AddSockRecv adds a socket to the list Ensemble listens to. When input on the socket occurs, this handler will be invoked on the specified environment variable.

void ce_AdddSockRecv(
    CE_SOCKET socket,
    ce_handler_t handler,
    ce_env_t env
);

ce_RmvSockRecv is called to remove a socket from the list Ensemble listens to.

void ce_RmvSockRecv(
    CE_SOCKET socket
);

7.3   Memory management

The convention used throughout is that all data-structures passed from C to ML are consumed by ML, and all data-structures passed from ML to C are owned by the C side (hence must be freed). This rule holds for all structures and data apart from the iovec-arrays.

Ensemble does not copy messages from C to the ML heap, rather, it separates C-memory and ML memory completely. Messages are received from the network and read directly into C-buffers. Sent iovecs are fragmented and sent directly on the network. Messages must be buffered until all group members reliably receive them. To this end, a reference counting scheme is used to track iovec liveness. When an iovec's reference count reaches zero, it is freed. In other words, iovec's are owned by Ensemble. They are received either from the user, or the network.

On linux, the type of an iovec is:

typedef struct iovec ce_iovec_t ;
typedef ce_iovec_t *ce_iovec_array_t;

To get better control of the iovec memory system, the alloc and free functions can be set by the user. The definitions are in lib/mm.h.

These define the types of alloc and free functions.

typedef void* (*mm_alloc_t)(int);
typedef void  (*mm_free_t)(char*);

The actual functions called to free and allocate iovec's.

mm_alloc_t mm_alloc_fun;
mm_free_t mm_free_fun;

Use these functions to set alloc and free. Be careful to do this exactly once at application initialization, before starting Ensemble.

void set_alloc_fun(mm_alloc_t f);
void set_free_fun(mm_free_t f);

The upshot of this is that when a user sends or casts a message, Ensemble takes over the message body. When a message is delivered to the application, the user may copy it, or perform any read-only operation while in the receive callback. The application may not modify a received iovec, or assume it owns it.

7.4   The flat interface

Using iovecs is a little complex for simple applications, therefore, a simplified ``flat'' interface exists.

The flat_receive callbacks take a C memory chunk, with it's length as arguments. This releases the application from merging together the set of buffers that consist an iovec-array, as well as releasing that array.

typedef void (*ce_appl_flat_receive_cast_t)(ce_env_t, ce_rank_t, ce_len_t, ce_data_t) ;

typedef void (*ce_appl_flat_receive_send_t)(ce_env_t, ce_rank_t, ce_len_t, ce_data_t) ;

Create a standard application interface using flat receive callbacks.

ce_appl_intf_t*
ce_create_flat_intf(
    ce_env_t env, 
    ce_appl_exit_t exit,
    ce_appl_install_t install,
    ce_appl_flow_block_t flow_block,
    ce_appl_block_t block,
    ce_appl_flat_receive_cast_t cast,
    ce_appl_flat_receive_send_t send,
    ce_appl_heartbeat_t heartbeat
);

Cast and Send operations that work with buffers instead of iovec-arrays.

void ce_flat_Cast(
    ce_appl_intf_t *c_appl,
    ce_len_t len, 
    ce_data_t buf
) ;

void ce_flat_Send(
    ce_appl_intf_t *c_appl,
    int num_dests,
    ce_rank_array_t dests,
    ce_len_t len, 
    ce_data_t buf
) ;

void ce_flat_Send1(
    ce_appl_intf_t *c_appl,
    ce_rank_t dest,
    ce_len_t len, 
    ce_data_t buf
) ;

7.5   An example

This section shows how to use the CE interface to write applications. We walk through the ce/ce_mtalk.c demo program.

ce/ce_mtalk.c, similarly to demo/mtalk.ml, is a multi-person talk program. Messages are read from the user via stdin, and multicasted to the network.

state_t is the state structure used by the program. It is the environment variable registered in the C-interface. The state contains the current view information, a pointer to its cappl, and a flag indicating if we are blocked.

typedef struct state_t {
  ce_local_state_t *ls;
  ce_view_state_t *vs;
  ce_appl_intf_t *intf ;
  int blocked;
} state_t;

A helper function to multicast a message if we are not blocked. We use the flat interface, to save the messy handling of iovec's.

void cast(state_t *s, char *msg){
  if (s->blocked == 0)
    ce_flat_Cast(s->intf, strlen(msg), msg);
}

A handler for stdin. This callback is called whenever there is input on the socket. The handler multicasts any message the user types on the screen. Be careful not to send messages if we are blocked.

void stdin_handler(void *env) {
  state_t *s = (state_t*)env;
  char buf[100], *tmp;
  int len ;
  
  fgets(buf, 100, stdin);
  len = strlen(buf);
  if (len>=100)
    /* string too long, dumping it.
     */
    return;

  tmp = ce_copy_string(buf);
  TRACE2("Read %s:", tmp);
  cast(s, tmp);
}

There is nothing special to do if we leave the group, the application essentially halts.

void main_exit(void *env)

When a new view arrives, update the environment structure. Do not forget to free the old view structure.

void main_install(void *env, ce_local_state_t *ls, ce_view_state_t *vs) {
  state_t *s = (state_t*) env;

  ce_view_full_free(s->ls,s->vs);
  s->ls = ls;
  s->vs = vs;
  s->blocked =0;
  printf("%s nmembers=%d", ls->endpt, ls->nmembers);
}

Ignore flow control problems. We are not suppose to have any of these, we are very low bandwidth.

void main_flow_block(void *env, ce_rank_t rank, ce_bool_t onoff)

Mark our blocked flag.

void main_block(void *env) {
  state_t *s = (state_t*) env;

  s->blocked=1;
}

Print out any message that we receive. Be careful not to free the received message.

void main_recv_cast(void *env, int rank, ce_len_t len, char *msg) {
  state_t *s = (state_t*) env;

  printf("recv_cast <- %d msg=%s", rank, msg);
}

Ignore send messages, we are not supposed to get any of these.

void main_recv_send(void *env, int rank, ce_len_t len, char *msg) {
}

Ignore heartbeats.

void main_heartbeat(void *env, double time) { }

Create a join options structure, and join the group ``ce_mtalk''. Use a regular virtually-synchronous stack. Put a handler on stdin such that whenever there is input, it will be called.

There is no need to set the transport in the join-options structure, the system uses the environment variable ENS_MODES in this case.

void join() {
  ce_jops_t *jops; 
  ce_appl_intf_t *main_intf;
  state_t *s;
  
  /* The rest of the fields should be zero. The
   * conversion code should be able to handle this. 
   */
  jops = record_create(ce_jops_t*, jops);
  record_clear(jops);
  jops->hrtbt_rate=10.0;
  //  jops->transports = ce_copy_string("UDP");
  jops->group_name = ce_copy_string("ce_mtalk");
  jops->properties = ce_copy_string(CE_DEFAULT_PROPERTIES);
  jops->use_properties = 1;

  s = (state_t*) record_create(state_t*, s);
  record_clear(s);
    
  main_intf = ce_create_flat_intf(s,
			main_exit, main_install, main_flow_block,
			main_block, main_recv_cast, main_recv_send,

			     main_heartbeat);
  
  s->intf= main_intf;
  ce_Join (&jops, main_intf);
  
  ce_AddSockRecv(0, stdin_handler, s);
}

The main entry point, initialize the ML side, process command line arguments, join the ce_mtalk group, and turn control over to the Ensemble event loop.

int main(int argc, char **argv) {
  
  ce_Init(argc, argv); /* Call Arge.parse, and appl_process_args */

  join();
  
  ce_Main_loop ();
  return 0;
}

7.6   Outboard mode

It is possible to run any CE application through a remote Ensemble server. Such a configuration is called an ``outboard'' configuration. The idea is to run a daemon on the local host that listens to TCP connections on a specific port, the daemon provides Ensemble services to connected clients. Such services include joining/leaving groups, and sending/receiving multicast and point-to-point messages on these groups.

A CE application can be configured to run in outboard mode by linking with the libceo library (suffix .a on Unix, .lib on WIN32). The user must then make sure that the Ensemble daemon is running, simply run the ce_outboard executable.

Using a daemon configuration has several benefits as well as some drawbacks. The advantages are: The disadvantage is performance loss. Each message now has to travel through a socket and another process before being sent on the network; vice-versa for received messages. This may outweigh the benefits of simple client code, and a minimal sized library.

The current port used by the outboard mode is 5002. This is configurable by running ce_outboard with the command line argument -tcp_port <port_num>, and modifying the OUTBOARD_TCP_PORT parameter in ce/ce_outboard_comm.h.

Care was taken to optimize memory consumption. Messages are sent zero-copy from the client, and they are copied once only into the server's buffers. A sent io-vector is consumed by the send function. Received messages are allocated at the client's buffers and handed to the application. After the application's receive callback, io-vectors are released. It was possible at this point to allow the application to take control of the io-vector, yet we chose to conform with the memory convections of the inboard mode.

7.7   Thread-safety

A thread-safe version of the library is also provided, it exports the exact same interface as the basic library. To use it link with libce_mt.so, or libceo_mt.so. For WIN32 systems link with .lib instead. The thread-safe library requires the application to synchronize its threads so they will not perform actions (send, cast, prompt, etc.) on a group while it is stabilizing. There are several thread-safe applications under the ce directory: ce_rand_mt.c, ce_perf_mt.c, and ce_mtalk_mt.c. These applications use a lock to ensure that sensitive group-state is accessed safely. Threads atomically check group-state before performing an Ensemble action.

The thread-safe library is designed as a wrapper around the basic library. A single thread runs both Ensemble main-loop and application callback handlers; this thread is known as the Ensemble thread. Other threads are refered to as user-threads. When a user-thread performs an action outside of a handler, the action is stored in a pending queue. A byte is sent through a socket to the Ensemble thread, notifying it that there is pending work to do. Asynchronously, the Ensemble thread ``wakes up'', consumes the queue, and performs all pending actions. Any actions invoked in the interim will also be stored in the pending queue; to be consumed along with the rest.

Any action invoked from within a callback is performed directly when the callback is completed and control returns to Ensemble.

Since a single thread performs the Ensemble main-loop as well as all user callbacks, callbacks must be short. Long-term computations should not be performed in the context of a callback.

There are three sensitive periods in which issuing Ensemble actions is not allowed, these are when joining, leaving, and blocking. A group is in:
joining state: between ce_Join and the first install callback.
leaving state: between ce_Leave and the exit callback.
blocking state: between the block callback and the succeeding install callback.
An example of a simple multi-threaded application is provided in ce/ce_mtalk_mt.c.

The overhead of adding thread-safety is 10% in the worst case, and normally much less than that. This should be acceptable for most applications.

7.8   A multi-threaded multi-person chat program

This program is a multi-threaded version of ce_mtalk.c Here, we walk through it and explain the interface and how to use it.

Include the system-independent thread header file, so we'll be able to use locks.

#include "ce_trace.h"
#include "ce.h"
#include "ce_threads.h"
#include <stdio.h>
#include <memory.h>
#include <malloc.h>

The NAME variable is used for internal tracing purposes of CE. There is no need to set it for standard user programs.

#define NAME "CE_MTALK_MT"

Apart for standard view state, the state structure keeps track of the current status of the group: blocked, joining, or leaving.

typedef struct state_t {
    ce_local_state_t *ls;
    ce_view_state_t *vs;
    ce_appl_intf_t *intf ;
    int blocked;
    int joining;
    int leaving;
    ce_lck_t *mutex;
} state_t;

Although we must define these callbacks, they do nothing in this program.

void main_exit(void *env)
{}

void
main_flow_block(void *env, ce_rank_t rank, ce_bool_t onoff)
{}

void
main_recv_send(void *env, int rank, ce_len_t len, char *msg)
{}

void
main_heartbeat(void *env, double time)
{}

main_install updates the view state. A lock must be taken to protect view state, as other threads may concurrently read the state.

void
main_install(void *env, ce_local_state_t *ls, ce_view_state_t *vs)
{
    state_t *s = (state_t*) env;
    
    ce_lck_Lock(s->mutex); {
        ce_view_full_free(s->ls,s->vs);
        s->ls = ls;
        s->vs = vs;
        s->blocked =0;
        s->joining =0;
	
        printf("%s nmembers=%d", ls->endpt, ls->nmembers);
        TRACE2("main_install",s->ls->endpt); 
    } ce_lck_Unlock(s->mutex);
}

The group is blocked, lock the state structure, and update the blocked flag. This notifies other threads not to attempt sending messages until the upcoming install callback. A lock must be taken to protect view state, as other threads may read it.

void
main_block(void *env)
{
    state_t *s = (state_t*) env;
    
    ce_lck_Lock(s->mutex); {
        s->blocked=1;
   } ce_lck_Unlock(s->mutex);
}

Received a message, print who sent it and its content.

void
main_recv_cast(void *env, int rank, ce_len_t len, char *msg)
{
    printf("%d -> msg=%s", rank, msg); fflush(stdout);
}

get_input is a non-terminating function performed by the user-thread of this program. In an infinite loop, read a line from stdin, and multicast it to the group. Prior to sending, check that the group is not blocked/joining/leaving. Status flags are shared information, and may be updated concurrently by an install or block callback. Hence, a lock is taken to protect access to the flags.

void
get_input(void *env)
{
    state_t *s = (state_t*)env;
    char buf[100], *msg;
    int len ;

    while (1) {
        TRACE("stdin_handler");
        fgets(buf, 100, stdin);
        len = strlen(buf);
        if (len>=100)
            /* string too long, dumping it.
             */
            return;
        	
        msg = ce_copy_string(buf);
        TRACE2("Read: ", msg);
	
        ce_lck_Lock(s->mutex); {
            if (s->joining || s->leaving || s->blocked)
               	printf("Cannot send while group is joining/leaving/blocked");
            else {
               	ce_flat_Cast(s->intf, strlen(msg), msg);
            }
        } ce_lck_Unlock(s->mutex);
    }
}

Initialize the state structure, and join the ``ce_mtalk'' Ensemble group. Take care to initialize the lock, and set the joining flag. The flag will be unset, allowing sending messages, in the first install callback.

state_t *
join(void)
{
    ce_jops_t *jops; 
    ce_appl_intf_t *main_intf;
    state_t *s;
    
    /* The rest of the fields should be zero. The
     * conversion code should be able to handle this. 
     */
    jops = record_create(ce_jops_t*, jops);
    record_clear(jops);
    jops->hrtbt_rate=10.0;
    jops->transports = ce_copy_string("DEERING");
    jops->group_name = ce_copy_string("ce_mtalk");
    jops->properties = ce_copy_string(CE_DEFAULT_PROPERTIES);
    jops->use_properties = 1;
    
    s = (state_t*) record_create(state_t*, s);
    record_clear(s);
    
    main_intf = ce_create_flat_intf(s,
                main_exit, main_install, main_flow_block,
                main_block, main_recv_cast, main_recv_send,
                main_heartbeat);

    s->intf= main_intf;
    s->mutex = ce_lck_Create();
    s->joining = 1;
    ce_Join (jops, main_intf);
    return s;
}

Initialize Ensemble, start the reader thread, and go to sleep.

int
main(int argc, char **argv)
{
    state_t *s;
    
    ce_Init(argc, argv); /* Call Arge.parse, and appl_process_args */

    /* Join the group
     */
    s = join();
    
    /* Create a thread to read input from the user.
     */
    ce_thread_Create(get_input, s, 10000);
    
    ce_Main_loop ();
    return 0;
}


7.9   Notes

Of the four transports supported by Ensemble : NETSIM, UDP, TCP, and DEERING, NETSIM is not supported for the thread-safe library. A socket is used internally, and NETSIM does not allow any external communication. Hence, it is unsupported.

Previous Contents Next