A Simulator for Dienst Distributed Searching Algorithms

(DRAFT: do not copy or redistribute)

Introduction:

While in the process of designing the "track one" distributed searching improvements to dienst, we were unable to make use of certain algorithms for the reliability metric because there were variables that would need to be computed based on our data, and there wasn't an easy way to do the computations. For example, in order to split out non-responding indexer data from response time data, we need a way to usefully combine this data in the reliability metric. While it is easy to come up with some equations, it is difficult to evaluate which equation would provide the best results in the context of dienst. It was also difficult to determine values to assign to various variables in the equations -- without running some data through a simulator, our choices would be arbitrary, or would require much fiddlin' in a production environment.

It was due to these difficulties that the "track one" distributed searching improvements fudged some details. We would like to munch data from dienst sites around the world and try out different values for constants, try replacing constants with variables, and try different algorithms and see how search performance and network bandwidth is affected. If nothing else, it would be useful to know if constants/variables/algorithms should be consistent worldwide or if we need to be able to adjust our them for different conditions.

 

Goals:

  1. to test how different reliability metrics and different values given to the variables and constants in a reliability metric affect the performance of distributed searching in dienst
  2. to test different scenarios of sharing performance metadata among dienst servers and acting on the shared data in a way that affects distributed searching
  3. to test different timeout values, possibly

Simulator results can inform not only which reliability metric is ideal for dienst and what performance metadata needs to be tracked and stored, but also how to manipulate constants and variables in the formulas used for the reliability metric and for combining data in the indexer state database.

 

Input Data:

Input data for the simulator needs to be derived from dienst logs. Such logs are readily accessible (for a Cornell Computer Science department member) at cs-tr.cs.cornell.edu and www.ncstrl.org; but it is hoped that we can get log data from dienst sites around the world. Specifically, we want information such as the following for each search:

  1. time initiated
  2. what indexers were used for which authorities in phase one
  3. what indexers were used for which authorities in phase two
  4. how did phase two end (after "track one" distributed searching improvements are applied)

It would probably be efficient for the log data to be put through a pre-processing batch program which would create data files that would then be used as input to the simulator.

 

The Simulator:

Given the input data files, it would simulate a system of interoperating dienst servers. Indexer state database (ISDB) files would be created and updated by the simulator based on the algorithm being tested and the input data; search results would be simulated. The output would be some sort of data files indicating what the search results would have been (time to user, which indexers were used) given the algorithms/variable values being tested for the ISDB and the reliability metric.

 

The Results:

The results would need to be made sense of: tabulated, evaluated statistically ... what have you. I see the results being published, as well as informing our determination of ISDB design, reliability metric design, values of constants and variable and so on in the production system.

And given the desire to share ISDB data among interoperating dienst servers in order to better predict dienst indexer behavior, I see the simulator as key in informing those design decisions as well.