ISDB & reliability metric background and current design

The Indexer State Database (ISDB) and reliability metric:

Background and Current (old) Design

(DRAFT: do not copy or redistribute)

Background questions

current design

Background questions

What is the ISDB and why do we need it?

In order for the dienst user interface to be able to simultaneously service different search requests, we need a place to store the most valid, up-to-date information about remote indexer performance. We keep this information about remote indexer performance in the indexer state database (ISDB), a file that is accessed by dienst processes spawned by the dienst user interface.

What do we mean by reliability metric and how is it implemented?

The reliability metric is the test we apply to determine if we should use a remote indexer for a particular search. The test asks if the remote indexer is "reliable" -- will it respond before the timeout value? If the answer is yes, then we use this remote indexer. If the answer is no, we "demote" the indexer: we don't use it for this search (or for any others for a while) and we try to use another indexer for the authority we're searching.

What does it mean to "demote" an indexer?

An indexer is "demoted" if it fails the reliability test -- if the algorithm predicts it will not respond before the timeout value. When an indexer is "demoted," it isn't used in any searches for a period of time. When an indexer is demoted, dienst is fault tolerant: it tries to find another indexer for the desired authority.

What do we mean by reliability retry and how does it work?

Once an indexer is demoted, it isn't used for any searches for a period of time indicated by $fail_retry_time, a global constant set in Config/config_constants.pl. Once that period of time has passed, we retry the reliability of the indexer: will it now respond to a search request before the timeout value? In the context of ISDB design, we need to know how to tell when we should perform the reliability retry.

Current Design

ISDB

In the current system, the ISDB has four pieces of information for each remote indexer:

Host
Port
Number of (consecutive) times this indexer has failed to respond
Reinitialization timestamp

The host and port can be thought of as the "key" of this data; the other information is "status" information.

Reliability metric

In the current release of dienst, the reliability metric is "has this indexer failed to respond before the timeout for the last five (default) consecutive searches?" We can tell this from the data in the ISDB -- if the number of consecutive failures is less than five, then the indexer passes the reliability metric and is used. If the number of consecutive failures is five or greater, then the indexer is used only if enough time has gone by to indicate a reliability retry is in order (see below).

If an indexer fails the reliability metric (and is not yet ready for a reliability retry), then we do not use this indexer -- a secondary indexer is sought for the authority in question.

Some problems with the reliability metric in the current system:

if an indexer fails frequently, but not consecutively, the current system will always use it.
"failure" is defined as not returning within the timeout for this search. When a phase one indexer responds after the end of phase one but before the end of phase two, we keep no statistics on the response time, and consider it a "failure."
It does not answer the question: can we reasonably expect the indexer to respond before the timeout value?

Demotion

An indexer is demoted if a "failure" is recorded that causes the ISDB entry for this indexer to fail the reliability metric in the future. In other words, if the ISDB entry has four failures, and we just got a fifth one, then when we update the ISDB entry with this information, we demote the indexer. The demotion is indicated by setting the reinitialization timestamp to the time of the demotion -- it is zero otherwise.

Reliability retry

When the ISDB is read, if an indexer fails the reliability metric AND if the reinitialization timestamp for an ISDB entry is older than $fail_retry_time (a global variable set in Config/config_constants.pl) then we do a reliability retry. In the current system, a reliability retry means we reset both the number of consecutive failures and the reinitialization timestamp to zero, use this indexer in the pending search and begin tracking data for this indexer all over again..

Some problems with the reliability retry in the current system:

We penalize the user – if an indexer has been unreliable, we check it by using it during user searches.
We wipe the slate clean on retry – we do not take recent unreliability into account. If an indexer is down, then the current system requires that we use it on five user searches before we take it out of service again, and we repeat this cycle at a frequency of $fail_retry_time (usually one hour).

Details on how reliability retry will be improved with track one changes.