Previous Status

[ Up ] [ Proposal ] [ Current Status ] [ Previous Status ]

Status - Q4 1998

This report summarizes progress on Cornell's Ensemble Project. The report covers the fourth quarter of calendar year 1998 and assumes a broad familiarity with our work. During Q4 1998, which included the latter part of the summer, we made rapid progress in several areas of our effort.

In what follows, we highlight some of the more interesting developments.

We continued work on our VIA based protocol suite for managing scalable cluster-style computers. This work builds on a multiyear effort to explore applications of group communication and group membership management in the context of cluster computing. In the past, we worked by adapting the Ensemble protocols and specializing them for the cluster. Our recent effort considers the direct implementation of mechanisms for cluster computing over VIA, the new Intel-developed standard within the SAN consortium of cluster computing providers.
With respect to our more stable Ensemble technology, one major issue is to pursue standards. With this in mind, we developed a proposal for an interface standard for reliable multicast technologies having various properties. The idea is to use a single standard interface under which the reliable multicast implementation might support various semantics, much as Ensemble and Horus use a standard interface for each protocol stack irrespective of the particular microprotocols which comprise the stack. One difference is that our proposal does not assume that the underlying technology is stack-based: stacks of microprotocols have costs which can be significant and must be compiled out using a variety of optimizations. Our interface is oriented towards a world in which such a stack might be optimized offline into a single module which would be plugged into the communication subsystem. The approach seems flexible enough to cover the various protocols used in Ensemble, the group keying work just mentioned, Scalable Reliable Multicast (SRM), the Reliable Multicast Transport Protocol (RMTP), etc. We have circulated this proposal to several vendors (including Microsoft and Sun's Java/RMI group), to the OMG (which is studying this general issue for a future CORBA standards effort), and will present it to the Quorum group studying such standards. Farnam Jahanian already has a copy. Our hope is that by promoting standards we can help advance the transition of multicast and reliable multicast into a widely accepted commercial technology base.
We continued our work on an Internet implementation of our new Bimodal Multicast technology and the associated probabilistic membership tracking technology. As you may recall from our Q3 progress report, the community that works on reliable multicast splits into two parts: researchers who build critical systems have taken reliability to mean atomic broadcast or virtual synchrony, but these protocols have not scaled very well in real settings. Researchers who take reliability to mean scalability with best effort reliabilityā€ have created relatively scalable protocols, but they give unstable throughput when failures do occur, and can't be used in many critical settings because of the lack of end-to-end guarantees. Starting late in 1997 we have pioneered a new approach that combines the benefits of these two alternatives.

During Q4 of 1998, our work on bimodal multicast and probabilistic membership agreement has continued to expand. We used this work as the basis of a new DARPA proposal; if funded, this would be an example of a situation in which Intel's funding to Cornell helped serve as a bridge to enable more substantial funding from DARPA. Some benefits of this new work include the following:

Innovative uses of IP multicast and router-level information. We see considerable potential for exploiting IP multicast more effectively in our protocols, with the result that latencies could be reduced and message loss overcome more cleanly. Right now, our emphasis is on the development of an Internet version of these two protocols; both currently exist in experimental forms, and neither is ready for production use. Going forward, we envision making these protocols increasingly adaptive, integrating them with security mechanisms, looking at the various IP multicast routing alternatives which have been proposed in the literature relative to the needs of this class of uses, and exploring options for using information known to the routers within the protocol. For example, these protocols might exploit the multicast routing tree in selecting optimal gossip patterns, or might use predicted message latencies in selecting an optimized message diffusion method.

Integration of these protocols into more traditional software engineering tools. We are exploring the use of probabilistic multicast in support of distributed shared memory, distributed versions of MIBs, DNS or X.500 services, and other conventional distributed computing tools. The question here boils down to one of offering the operating system user an interface that will look familiar and comfortable, and yet that reflects the strong properties associated with the underlying protocols. Currently, these sorts of tools tend to have ad-hoc protocols and each has its own properties, if they can be said to have properties at all. Perhaps it is not unreasonable to imagine a world of standardization around probabilistic properties for tools offering forms of distributed state. Unlike past proposals of this sort, which have carried the baggage of relatively complex, synchronous, application-layer protocols like virtual synchrony, the probabilistic protocols are simple, lightweight, and loosely coupled (asynchronous). Perhaps, given sufficiently natural embeddings, they can overcome the resistance that met previous generations of reliable communication technology.

Scalability analysis of probabilistic protocols. We are using experimentation and simulation to evaluate our protocols and to compare them with well known alternatives such as SRM and RMTP. This work confirms that our pbcast protocol scales with much lower overheads than SRM, for example, and that pbcast throughput is extremely stable, while SRM and RMTP throughput can be quite bursty when the network experiences noise.

Development of methodologies for reasoning with probabilistic uncertainty. Our new tools create a world in which data can be replicated probabilistically. How should algorithms that use such data operate? Clearly, we want them to be able to make local decision hence, the primary choice is between making a decision now and waiting until the quality of data improves. (The latter makes sense because with these protocols, as time elapses the quality of past information tends to improve). How should such a decision be made? Is there a theory of reasoning with uncertain information that can be used to inform the design of such algorithms? We are collaborating with Joe Halpern and Danny Dolev on some of these topics.

Extension of Ensemble to deal with large groups. This work has some elements of being a side-effect of the other topics we are pursuing. We are developing our probabilistic protocols using Ensemble but find that Ensemble is unable to scale to more than about 100 group members at a time; at this size, groups spontaneously partition and remerge for reasons not clear to us. We are therefore extending Ensemble to be able to maintain virtually synchronous groups with thousands of members, so that we can employ Ensemble tools as part of our research on probabilistic protocols.

4. Q4 1998 saw continued advances on our collaboration with the NuPrl group, which is developing a lightweight theorem proving capability for Ensemble's virtual synchrony protocols and security stack. This work continues to advance much more quickly than we had expected. Most recently, we have developed a strategy for completing the verification of our virtual synchrony protocol stack during the first half of 1999 and extending the work beyond that to include the Ensemble security layers and the new Spinglass protocols. Several papers are now in progress on this topic. We also completed a side-by-side comparison of optimized code produced using the automated Nuprl theorem prover with more standard Ensemble stacks. The optimization reduced software overheads by about 50% in a manner which is provably correct a finding which is perhaps less interesting for the actual speedup than for the methodology, although the speedup is considerable.

5. Robbert van Renesse and Xiaoming Liu have completed their implementation of a suite of adaptive protocols which sense their own performance and switch to a different protocol implementation (dynamically, without disrupting communication) if a different protocol might work better in the current setting. Their work initially focuses on switching between total ordering protocols on the basis of measured network latencies, but he plans to extend this in the future to cover other forms of adaptation and other types of dynamic sensing. This work is joint with BBN/GTE labs. Xiaoming is gradually extending this work into a full-scale system based on mechanisms and support for dynamic adaptation.

SOFTWARE DISTRIBUTION STATUS

All of our work is available to the public under a no-fee licensing scheme which grants unrestricted rights for others to use our technology. Our current user community includes:

	NAI (Network Associates Inc.) who are using Java Groups with Ensemble outboard for research in network security and management.
	Nortel, where the telecommunications research division is considering adopting Ensemble as the base technology for a network research effort focused on major telecommunications opportunities for the coming decade.
	BBN, where a number of users in the AquA effort are collaborating with us and using Ensemble.
	Computer Sciences Corp., which has developed their own Ada95 binding to Ensemble for GNAT.
	Technion and Hebrew Universities in Israel, which have extensive Ensemble-based research projects, and several other academic research groups which are using the system on a smaller scale.
	JPL, which is investing the use of Ensemble in a new generation of software technologies for distributed space applications.
	Lockheed Martin, which is exploring the use of SC-21 technologies in the Arsenal Ship and AEGIS systems. In France, Dassault-Aviation is also looking at Ensemble in this context.

Status - Q2 1998

Scalable Cluster Management and Communication - Werner Vogels

In collaboration with the Scalable Server Lab of Microsoft Research the Reliable Distributed Computing group continues to explore the application of group communication and management in commercial oriented distributed systems. We have performed a detailed analysis of the Microsoft Cluster Service, with the goal to investigate the scalability of the distributed algorithms used for cluster membership and intra-cluster communication. Preliminary results of the global analysis appear in a paper in the 1998 Fault-Tolerant Computing Symposium and a detailed scalability analysis appears in a paper in the 1998 Usenix Windows NT Symposium.

The results of the analysis point to significant problems in the management and communication algorithms. We are developing and implementing new membership algorithms targeted towards scaleable node-membership for clusters of 16 nodes and higher. Variations that implement hierarchical clusters (cluster of clusters) are also under investigation. An important focus in this research is the Engineering issues involved in developing membership as an operating system service: how should we structure the service to ensure that nodes under heavy load still are responsive with respect to management of the global cluster membership, which parts should reside inside the OS kernel, which part should be treated as real-time processes and which can be run as regular activities. All of this without taking away unnecessary resources from the applications running on the cluster.

Papers

Vogels, W., Dumitriu, D., Birman, K. Gamache, R., Short, R., Vert, J., Massa, M., Barrera, J., and Gray, J., "The Design and Architecture of the Microsoft Cluster Service -- A Practical Approach to High-Availability and Scalability", Proceedings of the 28th symposium on Fault-Tolerant Computing, Munich, Germany, June 1998

Vogels, W., Dumitriu, D., Agrawal, A., Chia, T., and Guo K., "Scalability of the Microsoft Cluster Service", Proceedings of the Second Usenix Windows NT Symposium, Seattle, WA, August 1998.

Mobile Agents - Greg Morrisett

Many applications of mobile agents are compute intensive. For example, an agent specialized to look for certain objects within video streams may require an enormous number of cycles. Today, most mobile agent systems are based on relatively high-level languages, such as the Java Virtual Machine (JVM). Whereas the JVM is both portable and type- safe, using it as a platform for mobile agents requires that the host either interprets or just-in-time compiles the JVM bytecodes.

Unfortunately, both interpretation and just-in-time compilation can have tremendous performance penalties when compared to assembly code that is produced by an offline, highly- optimizing compiler. Furthermore, each host must trust that its interpreter or JIT compiler is correct, else a malicious agent may take advantage of the bug(s) to subvert the security of a host.

To address these issues, we have developed a type system for the 32-bit subset of the Intel IA32 (x86) architecture. Our Typed Assembly Language (TAL) is expressive enough that it can encode high-level source abstractions, such as private fields in objects, yet allows programmers or compilers to generate highly-optimized code.

In our mobile agent system, agents can be compiled from a high-level source language to TAL. The TAL code can be shipped to a host, which can then verify that the assembly language is type correct (i.e., respects the host's interfaces.) Then the TAL code can be immediately executed. Hence, the host need not pay the performance penalty of interpretation or just-in-time compilation. Furthermore, the host does not need to have access to a trusted interpreter or compiler.

For more technical information about TAL, please visit the TAL web pages.

Multimedia & Server Systems Group - Thorsten von Eicken and Brian Smith

The Multimedia and Networking group at Cornell has received and installed 9 300MHz Pentium-II systems (3 dual processor and 6 single processor). Each of these systems is running Windows NT 4.0, and is equipped with a 21" color monitor and several have a Videum video capture card with a camera. One of the sytems has an omnidirectional camera attached to the video card. All these machines are installed in the Systems Lab, which is part of a larger facility including researchers in computer vision, networking and operating systems. There is considerable cross-group work in this facility, such as the Lecture Browser project being pursued jointly with Prof. Huttenlocher's group (there are slides from a talk describing this project).

Over the past two years, as a result of the TechEd grant and prior Intel equipment donations, we have moved our systems development to Intel Architecture machines, under Windows NT, using Visual Studio as our development environment. We have ported the U-Net user-level networking architecture, our multimedia tools, as well as a number of applications that previously ran on Unix systems. All new development in the group is taking place in this environment. Completed projects since the beginning of the TechEd 2000 grant (Fall 1997) include:

An operating system microkernel written in Java (the J-Kernel) which allows multiple applications to run within a single Java virtual machine and to communicate with each other safely. The J-Kernel is, in essence, a capability operating system implemented without any hardware support.

The J-Server bridge between Microsoft's Web server (IIS) and the J-Kernel allowing untrusted code (servlets) to be uploaded to the web server and run in its process to handle HTTP requests. This allows users of the web server to upload custom code to serve dynamic web pages without administrator involvement.

The first version of the Dalķ high performance library of routines for manipulating video, audio, and image data was developed on the Intel machines and released in March. Unlike other multimedia libraries, Dalķ is designed for maximal performance, not ease of use. In the next phase of our research, we will use Dalķ as the target language for code generation from a compiler of a higher-level multimedia language.

In collaboration with Dan Huttenlocher, we developed the first prototype of the lecture browser, a system for automatically recording lectures and publishing them on the Web. The lecture browser is a system that MPEG encodes video data captured by several (currently 3) cameras in a lecture hall and transfers the recorded audio and video data to a server where a variety of processing takes place. The end result is a complete record of the lecture that can be viewed with Netscape or Internet Explorer. The automatically edited video uses multiple camera angles to give the user a greater sense of presence, and an automatically synchronized slide track compensates for the relatively low slide quality in the video. For more information contact:

	Prof. Thorsten von Eicken (tve@cs.cornell.edu)
	Prof. Brian Smith (bsmith@cs.cornell.edu)

Last modified on: 10/06/99