track one proposal

Dienst: distributed searching improvements - "track one"

(DRAFT: do not copy or redistribute)

Goal: new release with distributed searching improvements ready by September 30, 1997

Problems we're trying to solve:

searches take too long

poor predictions of whether an indexer will respond before the timeout value

continued research in distributed searching

Proposed changes

Improve timeout values for phase one and phase two

Rework method for Reliability retry

Improve reliability metric

Overlap phase one and phase two searching

Indicate phase two searches in dienst logs

Tabulate phase statistics in log summaries and in Dienst/htdocs/dienst_runtime/logs.html

Improve timeout values for phase one and phase two

Determine better timeout values for each phase by

Examining search response data in logs

Phase statistics

cs-tr.cs.cornell.edu from 1-Jan-96 to 30-Jun-96

cs-tr.cs.cornell.edu from 1-Jul-96 to 31-Dec-96

cs-tr.cs.cornell.edu from 1-Jan-97 to 30-Jun-97

www.ncstrl.org from 31-Jul-96 to 28-Aug-97

Changing value on ncstrl and/or cs-tr and examining logs after change

New value will still be a hardcoded constant in config_constants.pl

Rework method for reliability retry

If an indexer is deemed "unreliable" and is demoted, we currently put it back in service; we retry the reliability of the indexer at the expense of our users. The new method will fork a process that calls the remote indexer with a dienst Version request. If the Version request doesn't respond before the timeout then, a) re-set the retry interval for this indexer in the indexer state database (ISDB) and b) if we are able to learn the e-mail address of maintainer@remote.indexer and it's not a network problem (ping works) then we might send a message indicating failure of dienst server at remote site. (Dave's notion is to have these e-mail addresses at MMS and at RMS; also, we would need to be careful to avoid flooding mailboxes with messages.) If the remote indexer Version request does respond before the timeout, then reinitialize ISDB for this indexer (perhaps after doing further response time testing with a search request at the remote indexer).

Improve Reliability metric

This will involve a redesign of the data kept in the indexer state database (ISDB) as well as a reworked reliability metric. The goal is to greatly improve our prediction of "will this indexer respond before the timeout value?" and use the predictions to reduce the number of searches that enter phase two.

More details about the new ISDB and reliability metric

Overlap phase one and phase two searching

Continue listening for phase one indexers after phase two has started. We will need to maintain a list of indexers called in phase one and remove an indexer from the list whenever we receive valid results from said indexer. Once we hit the phase one timeout, we keep track of the phase one indexers (authorities?) that haven’t yet responded and we also keep track of the phase two indexers (authorities?) we’re calling. At this point we take results from either phase one or phase two, and finish taking results when we get to the first of the following conditions:

all authorities are accounted for (phase one or phase two responses)

phase two timeout

We need to make sure we don’t deliver duplicate results for any authority, we need to keep response time information even if we don’t use results from a particular indexer, and we need to keep track of how phase two ended in the logs (see next proposed change).

Indicate phase two searches in dienst logs

We want to make sure we are always capturing the data to indicate when we are entering phase two, which indexers were used in phase two, and how phase two ended. Possible methods for indicating this:

Have separate STATISTICS log entries for phase one and phase two results

Add a fake indexer to the STATISTICS log entry to indicate start of phase two? (indexer=phase.two)

Add message to log indicating phase two is entered, which indexers were called, and how it finished; still have one STATISTICS log entry per search

Tabulate phase statistics in log summaries and in Dienst/htdocs/dienst_runtime/logs.html

This will include both of the following:

code to reflect the new phase two indications as noted above, but that will work fine on older dienst logs as well
code that will infer phase two searching for older dienst logs