Creating Reliable Distributed Systems in Java

March 9, 1999

Distributed object applications written in Java are implemented through Java Remote Method Invocation (RMI). RMI applications are often comprised of two separate programs: a server and a client. The server creates some remote objects, makes references to them accessible, and waits for clients to invoke methods on these remote objects. The client gets a remote reference to one or more remote objects in the server and then invokes methods on them. RMI provides the mechanism by which the server and the client communicate and pass information back and forth. RMI eases the development of distributed application, however, it does not directly improve the reliability of the applications. RMI applications are not reliable because a single failure of the server will lead to a failure of the whole application.

Building reliable distributed object applications is an interesting research area. But few researches have been done in developing fault tolerance techniques in RMI. There are two research groups are working on this area. One is from CS Dept. in New York University and Lucent Technologies Bell Laboratories. Another is from CS Dept. in UC Berkeley.

NYU-Lucent group built replicated fault-tolerant Java server objects through a Java package called Filterfresh. They created a logical group by a GroupManager Object instantiated with each replica in order to maintain the correctness and integrity of replicated servers. A consistent group view and a reliable multicast mechanism were maintained by a Group Membership algorithm. In the implementation of Filterfresh, two step were taken. First, GroupManager class was used to construct a fault tolerant RMI registry. Second, a transparent client-side mechanism that enables application server failures to be tolerated at the stub layer was built.

Berkeley group built a carefully controlled environment out of a collection of execution environments that are receptive to mobile code and used dynamically generated code to introduce run-time-generated levels of indirection separating clients from services. They tried to address all of the difficult service fault-tolerance, availability, and consistency problems in the controlled environment. A Java-based prototype implementation of a Base call MultiSpace was built and two novel services were implemented. They demonstrated that the applications built from their implementation of MultiSpace were fault-tolerant and scalable.

Both above approaches can improve reliability of the RMI service from SUN. However both methods have to build a complicated Java package and thus are not completely compatible to normal Java RMI. In order to support existing normal RMI applications, considerable changes of the Java source code have to be made in their approach. It would be nice to be able to reuse the existing normal RMI applications as much as possible and add fault-tolerance functionality to those applications. This is the major advantage of my approach.

My goal is to design a solution of extending Sun's Java RMI upon group communication to form a basis for highly available (fault tolerant) client-server application. More specifically, instead of making a remote method call to a peer remote object (RMI server), a client would transparently make a call to a server group which contains a set of replicas of RMI servers, and return the first response from those remote objects. In this way, when one or more servers are temporarily down, the client side can still acquire available response from the other healthy servers even without knowing where the failure points are. Therefore, reliable Java client-server applications can be built.

In order to implement above methodology, the way to pass the remote objects has to be designed and specified. Generally, remote objects are passed by reference. A remote object reference is a stub, which is a client-side proxy that implements the complete set of remote interfaces that the remote object implements. The stub code runs on the client and converts client requests for service functionality into network messages, potentially marshalling request parameters and unmarshalling the results. Normal Java stub is generated by SUN's RMI compiler, i.e., rmic. But this approach can become a performance bottleneck and a single point of failure. One of the solutions is to use Dispatch Stub, which is needed to implement the architecture of multiple (or clustered) server.

The base service of Dispatch Stub employs a similar technique to Java RMI except that the stub code contains embedded logic to select from amongst a cluster of servers. Loading balance is thus implemented within this Dispatch Stub. Failure is similarly accomplished by reissuing failed or time-out service calls to an alternative replica server. Client application can be coded without knowing the Dispatch Stub logic, and can simply assume that invocation to the service through the Dispatch Stub is complete, or otherwise return an error if no back-end nodes can complete the request. Dispatch Stub will be created by modifying the normal RMI compiler from SUN. That will include modify the source code from Java JDK 1.2. There are two approaches to get the result. One is to create a new RMI compiler, another is to add a new option to the SUN's rmic. The latter approach will be used in this project in order to maintain compatibility.

The reliability of Java distributed applications can be tested by implementation of this methodology. The testing procedure includes four steps. First, pick up any available RMI application written in Java, such as the WhiteBoard application from CMU, and run it without modification. Obviously, this system is not reliable because the application will fail due to a single failure of the server. Second, extend the application with clustered servers, wrapping which into a particular server group associated with concept of “channel” presented in JavaGroups package. Third, use the new option in the RMI compiler to generate the Dispatch Stub. Last, take down one or more servers to see if the application is still available. The application should work fine unless all servers are taken down.

Appendix: Main References

Arash Baratloo, P. E. Chung, and Yennun Huang "Filterfresh: Hot Replication of Java RMI Server Objects" Dept of Computer Science, New York University

A. Wollrath, R. Riggs and J. Waldo. "A Distributed Object Model for the Java System. USENIX Journal, Fall 1996

Bela Ban "JavaGroups - Group Communication Patterns in Java" http://www.cs.cornell.edu/home/bba/

N. Brown, C. Kindel. "Distributed Component Object Model Protocol - DCOM/1.0." Internet Draft, 1996

Object Management Group. "The Common Object Request Broker: Architecture and Specification 2.1", 1997

Steven D. Gribble, matt Welsh, Eric A. Brewer, and David Culler "The MultiSpace: an Evolutionary Platform for Infrastructural Services". http://ninja.cs.berkeley.edu/dist/slides/ninja_retreat_jan_99

S. Maffeis. IBus - "The Java Intranet Software Bus". http://www.softwired.ch/ibus.htm

Sun Microsystems. Java remote method invocation - distributed computing for java. http://java.sun.com/

The Object management Group (OMG). The common object request broker: Architecture and specification, February 1998. http://www.omg.org/library/c2index.html

W. Jia. "Implementation of a Reliable Multicast Protocol". Software Practices and Experience, July 1997