Comments on homework 2

These comments are intended to identify some common omissions made in the project proposals, and give you an idea of what you should be thinking about in working on your design. There will be a homework posted soon which will ask you to explain how your project will deal with these issues, so you are advised to read these comments carefully!

What is a scalable system?

Suppose we have a system providing an online service, which consists initially of a server linked to a database running on a separate computer. The server can only cope with a finite number of concurrent requests. This limit could be dictated by a number of different factors, for instance:

· the bandwidth to the server

· the processing power available on the server

· the memory available on the server

· the bandwidth between the server and the database

At least one of these factors will be a bottleneck, dictating how much load the server can deal with. To arrive at a number like “client requests served per second”, we need to decide what the load of a typical request will be: how much data it sends to the server and gets back, how much data the server and database must transfer between them, how much processing power and memory serving the request requires, and other, more subtle factors. These might include how much benefit the server might derive from caching data from the database, or from caching past computations, in the hope that it might be able to reuse this data to serve future requests.

So what do we do if we are going to exceed the load the server can deal with? We can optimise the factor causing the bottleneck to increase the load, or we can try and put in extra servers to distribute the load. This introduces its own corresponding problems.

Replicating servers and the consistency problem

We can’t just throw more servers at the system blindly: we have to do something to coordinate their activities. For instance, suppose that servers cache database state to avoid having to go to the database at every request. Now, with two or more servers, if one updates the database, it may invalidate data which another has cached. The general principle is: if we replicate servers and they share any information, it must be kept consistent! And maintaining consistency doesn’t come for free: it involves communication between the servers, which will diminish the benefits which we get by replication. Generally, the more state that must be kept consistent, the more communication required to keep it consistent, and the less extra capacity we get from replication.

Replicating servers and reliability

Replication does offer an additional benefit, because we get added redundancy and therefore, resilience to failures. But a failure among replicated servers will have two consequences: first, the overall capacity of the system goes down, and second, when the server comes up, it must bring itself up-to-date with the other servers. If there is a lot of state maintained among servers, this can be a slow and nontrivial procedure!

Where do you come in?

From reading the previous paragraphs, it should becomes clear that designing for scalability and reliability is a non-trivial exercise: there are lots of factors to consider and tradeoffs to be made. Statements about how scalable or reliable your system is can’t be made without numbers: you need to come up with a model of a typical client interaction (how long it is, what database accesses it involves, how much time it takes, and so on), and then you can arrive at an estimate of how many concurrent interactions you can support. Similarly, to estimate the benefit of replicating the server, you need to work out how much state needs to be shared and what the cost of keeping it consistent is. Finally, you need to consider what parts of the system might fail, what the consequences might be, and how (or if) you might guard against it. Having identified all these factors, you can quantify scalability and reliability by simulation or analysis.

A fair number of people wrote that they would rely on WebLogic (or E-Speak) to provide the reliability and scalability guarantees for their projects. We would prefer that you adopt a more critical and experimental approach. One of the things we hope you will learn from taking CS514 is that reliability and scalability are quantities which are often under specified or not taken seriously in commercial software products, and that there are technologies such as group communication which can provide stronger and more precise guarantees, and often offer higher performance than the popular solutions. Rather than just relying on what WebLogic gives you, you might consider how you can increase the reliability of the system that WebLogic allows you to build, and experimentally compare this against the “vanilla” scheme. Whether by rewriting components or using more reliable “wrappers”, what sort of guarantees can you add? Does your resulting scheme perform better or worse? Is it more resistant to crashes? Please bear in mind that a well-thought investigation along these lines will not be penalised for coming up with a negative result!

If you have questions about these issues, Ben Atkin would be happy to answer them, either through e-mail at batkin@cs.cornell.edu, or if you drop by during his office hours (Tuesday 11-12, Wednesday, Friday 2:30-3:00, in Upson 5138).