track two possibilities

Dienst: distributed searching improvements - "track two"

(DRAFT: do not copy or redistribute)

Here are somewhat random, somewhat informed thoughts about what improvements to focus on after implementing the "track one" dienst distributed searching changes.

Dynamic timeout values for both phase one and phase two:

Given Jim French's work on the mathematical distribution of our response time data, it should be possible to have more accurate predictions of timeout values given a small number of data points, thereby making it possible to appropriately adjust timeout values as conditions change.

Timeout values and reliability metrics attuned to each remote indexer, as opposed to blanket values and algorithms for all remote indexers.

Further improvements to the ISDB and to the reliability metric:

At a minimum, it might be useful to separate non-responding indexers from responding indexers in the algorithm. Given results from the simulator, we should be able to find an appropriate way to do this that maximizes performance for the users. We had originally intended to have a timed low pass filter for failure rate in addition to the timed low pass filter for response time, but it turned out that determining a reasonable reliability metric that combined these two filters would have required a great deal of computation in order to determine various constants and variables. The simulator will provide a way to perform these computations.

It is likely that an ideal system would have no constants in the reliability metric or ISDB formula: all would be variables. We might even want a reliability metric that changes as conditions change.

Better fault tolerance:

If we distinguish between indexers failing to respond at all and actual responses that are errors, and also between failures to respond due to connectivity problems and failures to respond due to server problems, then we can have better fault tolerance. Again, this informs our choices of indexers.

We could also pay attention to server load and network load, potentially.

There is a whole avenue to pursue here involving the sharing of performance metadata among remote indexers to improve fault tolerance.

Dynamic time intervals for the reliability retry interval.

Poll indexers for response time data during lulls, to better inform indexer choices.

Choose indexers based on response time as well as or in place of the ordered mapping of authorities to indexers maintained at the Master Meta Server.

If we choose indexers based on response time, then we probably want to add a random factor into the computation of expected response time. This is to avoid having all searches go to one or two fast indexers, thereby slowing them down. [thanks to Robbert van Renesse for this notion]

Adjust the set of indexers polled based on collection data, as well as performance data.

Jim French expressed an interest here; Dave Fielding is also interested, I believe.

Possibly return search results as received, rather than after the search has ended.