NCSTRL Documentation

Dienst protocols, Release 3.5

DRAFT

Introduction

This document describes the Dienst protocol, which provides an open, distributed digital library. The Dienst protocol is currently implemented by the Dienst system, and is the basis for the CSTR digital library.

Overview of Dienst architecture

In the Dienst architecture, there are four classes of services: A Repository Service stores digital documents, each of which has a unique name called a docid and may exist in several different formats. An Index Service server searches a collection and returns a list of docids. Each site will typically run a repository and index service for documents issued by that site. A single, centralized Meta Service (also called a Contact Service) provides a directory of locations of all other services. Finally, a User Interface service mediates human access to this library. All these services communicate via the Dienst protocol.

Note that the protocol has evolved over time and not all Dienst servers are running the most recent release.

Document Identifiers

A docid is string which uniquely specifies a technical report. It consists of a publisher and a string, separated by a colon. The tokens may not contain whitespace or the colon character. Publishers are defined just as in RFC 1357, e.g. CORNELLCS, STAN, UCB etc.

The string is assigned by the publisher, and must be unique within that publisher. An example is CORNELLCS:TR92-1321. The syntax of a docid differs from that of an ID in RFC 1357 only in that the separator between publisher and string is a colon rather than a pair of slashes. This change is because the pair of slashes would look strange in the URL syntax.

Dienst HTTP embedding

The Dienst protocol is (currently) embedded in HTTP, which thus imposes some restrictions on the protocol that are specific to HTTP, not to Dienst.

HTTP request methods

All Dienst requests must be expressed with either the GET or HEAD HTTP methods. In general, GET returns full information, and HEAD returns only meta information. Not all Dienst requests support HEAD.

Special characters

The syntax rules for URLs restrict a few characters to special roles. and require that if these characters are used in any other way that they be written as an escape sequence, a percent sign followed by the character code in hexadecimal. The reserved characters are:

/ - separates components in the URL.
? - separates optional arguments from the rest of the URL
# - indicates reference to a named anchor within a document
= - separates name from value in an argument list
& - separates multiple arguments after a ?

Finally, the space character may not appear anyplace. It must be written with a "+" (or with a percent sign escape sequence.)

optional arguments

Many of the Dienst protocol messages take optional arguments. These arguments consist of a parameter name and value, separated by an equal sign. All the arguments are joined, separated by an ampersand, and attached to the end of the URL, separated from it by a question mark. So for example, to pass the parameter timeout with value 259 to the Shred method, the URL would be Shred?timeout=259, and if a second argument weight were added, the URL would then be Shred?timeout=259&weight=7.4.

Standard record list header

Many Dienst messages returns lists of results. Many, but not all, of these return lists of records. (Those that do not are older and retained for compatibility.) Such lists are always prefaced with a standard header consisting of two lines:
Version: version
Where version is a version number. At present, all messages are using version 1.0. This allows for change in the format of the record list header or the record format.
Count: N message
Where N is the number of records that follow and message is an optional error message string.

Protocol messages

For each class of Dienst service we list the messages it implements. Note that in the current implementation, conceptually distinct services (Repository, Index) are accessed through a common Web server and share the same host and port, and thus a message is seen by all of them, though only one will reply. The messages are listed by name, followed by the syntax of the URL that encodes the message.

Generic Messages

Version

/Server/Info/Version

returns the version of the service, e.g. Dienst v3-6-0. Note that older or customized servers may return a different string.

Time

/Server/Info/Time

returns the local time in RFC 1036 format. Timezone is omited. An example is: Thu, 22 Jun 95 09:16:43

Repository Service

The repository allows a given document to be stored in many different formats, and provides messages to obtain the document or pieces of the document in any of the stored formats. In Dienst releases prior to 3.5, formats are named with MIME types, in Dienst 3.5 and after, formats are named with reserved keywords (e.g. "ocr", "postscript", "scanned").
Format names
Formats describe the intended purpose, rather than the representation, which is better described by a MIME type.
bib
Bibliographic information in RFC-1357.
postscript
The entire body of the document, sent as application/postscript
text
plain ASCII text, sent as text/plain
ocr
ASCII text produced by OCR, sent as text/plain
scanned
scanned page image, usually TIFF, at at least 300 spots per inch.
inline
a page image, suitable for screen display. Usually a GIF, at about 72 dots per inch, four bits per pixel.
structure
a document structure file

In addition, there are a number of internal formats, not documented as part of the protocol.

List Contents

/Server/List-Contents

A list of the docids available from this service, one per line.

Get Document Body

/Server/TR/docid/Body[?format=format]

Return the body of the document, in the selected format.

Get Page

/Server/TR/docid/Page/NNNN[?format=format]

Return a single page, where the document is available in discrete pages, in the selected format. Reasonable values for format for Dienst 3.5 are scanned or inline.

Get Page Count

/Server/TR/docid/NPages[?format=format]

Return the number of pages for this document, when it is available in discrete pages.

List MIME types

/Server/TR/docid/Formats

This is an older message retained for compatibility. Its use is not encouraged. It returns a list of the available MIME content types for the document, rather than a list of the Dienst 3.5 formats. The returned list consists of lines of the form:

content-type size

where content-type is the MIME content type, and size is in bytes, if it can be determined. (In general, the size can only be determined if the data is stored in a single file.) There is no guarantee that, if the data is retrieved in this form, that this is the number of bytes that will actually be transmitted, as it is possible that the file might be stored compressed, but be transmitted uncompressed, or vice versa.

Index Service

The index service searches a set of descriptions of documents and return docids for those that match. Document descriptions (bibliographic information) are stored in the RFC 1357 format.

Get Bibliographic Records

/Server/Bibliography
/Server/Bibliography?docid=docid
/Server/Bibliography?file-after=time

Returns the bibliographic information for documents on the service. The first form returns all bibliographic records, the second form for a single document, and the third checks for all documents added or modified since time, a universal time expressed in RFC 1036 format. Note that this is distinct from any dates encoded internal to the bibliographic record, e.g. the date the document itself was written.

Search

/Server/IndexBoolean/?kwds

Searches the collection. kwds is a set of keywords and values specifying the search criteria. Returns a record list where each record begins with a blank line, then has docid, title, author, date each on a separate line.

allowable keywords
title
words from the title.
author
author's last or first name.
abstract
words from the abstract.
any
search for words in any of the title, author, or abtract fields, e.g. any=smith will find documents written by Smith or with Smith in the title.
publisher
symbolic name of publisher. Defaults to "any".
number
The number of the document, e.g. 259.
boolean
The connective between operators, either and (the default) or or.
Rules for bibliographic keyword matching
Words in the three bibliographic keyword fields (author, title, abstract) are matched to bibliographic entries according to the following rules:
examples
reports written by either "Davis" or "Fox"
/Server/IndexBoolean/?author=davis+or+fox
reports written by "donald" and with "robot" in the title.
/Server/IndexBoolean/?author=donald&title=robot
reports written by "donald" or with "robot" in the title.
/Server/IndexBoolean/?author=donald&title=robot&boolean=or

Search (old format)

/Server/Index/(\\?.*)

This is an older form of search. It will be supported until all servers are running at least version 3.5, and then will cease. It differs from IndexBoolean in that the boolean keyword is not supported, nor are boolean operators allowed in fields.

Meta Service

Get Publishers

Syntax:/MetaServer/Publishers

Returns a record list of the publishers in the collection. Each record consists of the publishers symbolic name and "pretty name", separated by the ASCII FS character (octal 034).

Get Index Servers

/MetaServer/Indeces

Returns a record list of the Index services. Each record consists of four fields separated by the ASCII FS character (octal 034):

host
port
publishers
List of symbolic names of publishers, separated by colon
protocol
The protocol running at the server. The only supported protocol is DIENST_handler

Get Repositories

/MetaServer/Repositories

Returns a record list of the Repository services. Each record consists of four fields, separated by the ASCII FS character (octal 034):

host
port
obsolete
Do not use this field.
publishers
The symbolic names of publishers in this repository, separated by colon.

UI Service

There is no "UI Service" protocol. Each Dienst UI service is free to implement any user interface that the local site finds helpful. These URLs are documented simply for convenience, and may or may not be available on any given service.
/Document/docid
Return a nicely formatted HTML page summarizing information about the document. You can send this message to any UI service, and if the document is not stored on that service it will relay the message to the relevant UI service.
/TR/Search
Return an HTML form for searching the collection.
/TR/List/Authors
Return a list of all authors in the index service at the site where this UI service is running. This is not a list of all authors in the entire collection, only those at the local site. Typically this list will include hyperlinks to search for all papers by those authors.
/TR/List/Numbers
Return a list of all documents authors in the repository service at the site where this UI service is running. Note that "numbers" is a misnomer, a better name would be "docids" but the old name is retained for compatibility.

Up to Main Information Menu


NCSTRL Documentation
Any comments or questions?
Contact us at help@ncstrl.org.

Acknowledgements

This work was supported in part by the Advanced Research Projects Agency under Grant No. MDA972-92-J-1029 with the Corporation for National Research Initiatives (CNRI). Its content does not necessarily reflect the position or the policy of the Government or CNRI, and no official endorsement should be inferred. This work was done at the Design Research Institute, a collaboration of Xerox Corporation and Cornell University, and at the Computer Science Department at Cornell University.

Up to Main Information Menu


NCSTRL Documentation
Any comments or questions?
Contact us at help@ncstrl.org.