Gerard Salton
PhD Harvard University, 1958

For information about Professor Salton's research, please contact Chris Buckley

Natural-language text processing is a rapidly expanding field of research and development. Large masses of machine-readable text now exist that can be cheaply stored on high-density optical storage media and rapidly retrieved on demand. Furthermore, sophisticated methods are available for analyzing document texts, formulating appropriate user queries, conducting rapid file searches, and ranking the retrieved items in decreasing order of importance to the users.

At Cornell, we design and operate large, general-purpose text processing environments where texts can be handled without restrictions on size or subject matter. In the absence of knowledge bases that would be useful for unrestricted text databases, we use corpus-based text analysis systems that determine the meaning of words and expressions by a refined context analysis using statistical and probabilistic criteria. Using the corpus-based approaches, we are able to determine text similarity with a high degree of accuracy. There are two main applications:

We have done extensive work with an automated encyclopedia containing about 25,000 encyclopedia articles (the Funk and Wagnalls New Encyclopedia). In addition, we are processing the TREC collection of about 800,000 full-text documents covering a number of different subject areas (over 2 gigabytes of text).

A sophisticated search and retrieval service exists, as well as a text linking system capable of relating different text sections, paragraphs, and sentences. The main test vehicle continues to be the current version of the SMART text analysis and retrieval system, operating under UNIX on Sun Sparc Stations and Sun-4 terminal equipment.

