1927-1995 |
In Memory of:
Gerard Salton
Professor
PhD Harvard University, 1958
|
For information about Professor Salton's research, please contact
Chris Buckley
Natural-language text processing is a rapidly expanding field of research
and development. Large masses of machine-readable text now exist that can
be cheaply stored on high-density optical storage media and rapidly
retrieved on demand. Furthermore, sophisticated methods are available for
analyzing document texts, formulating appropriate user queries, conducting
rapid file searches, and ranking the retrieved items in decreasing order of
importance to the users.
At Cornell, we design and operate large, general-purpose text processing
environments where texts can be handled without restrictions on size or
subject matter. In the absence of knowledge bases that would be useful for
unrestricted text databases, we use corpus-based text analysis systems that
determine the meaning of words and expressions by a refined context
analysis using statistical and probabilistic criteria. Using the
corpus-based approaches, we are able to determine text similarity with a
high degree of accuracy. There are two main applications:
- The automatic generation of structured text collections (hypertext) where
semantically similar pieces of text are automatically linked. Hypertext
representations of large databases provide flexible browsing capabilities
for general-purpose text access.
- The automatic retrieval of interesting text excerpts in response to
available search queries.
We have done extensive work with an automated encyclopedia containing about
25,000 encyclopedia articles (the Funk and Wagnalls New Encyclopedia). In
addition, we are processing the TREC collection of about 800,000 full-text
documents covering a number of different subject areas (over 2 gigabytes of
text).
A sophisticated search and retrieval service exists, as well as a text
linking system capable of relating different text sections, paragraphs, and
sentences. The main test vehicle continues to be the current version of
the SMART text analysis and retrieval system, operating under UNIX on Sun
Sparc Stations and Sun-4 terminal equipment.
University Activities
- Member, Engineering College Library Committee
Professional Activities
- Associate Editor, ACM Transactions on Information Systems
- Program Committee: SIGIR-95 Eighteenth International Conference on
Research and Development in Information Retrieval, Seattle 1995;
SIGIR-96 Nineteenth International Conference on Research and
Development in Information Retrieval Zurich, Switzerland, 1996
Awards
- Fellow, Association for Computing Machinery
Lectures
- Automatic hypertext construction for structured texts. Swiss Federal
Institute of Technology (ETH), Zurich, Switzerland, June 6, 1994.
- Automatic text structuring and decomposition. Computer Science Colloquium,
University of Maryland, College Park, MD, November 1, 1994.
- Automatic passage retrieval. TREC-3 Conference Workshop, Gaithersburg, MD,
November 2, 1994.
- Modern Information Retrieval. Computer Science Colloquium, Dartmouth
College, Hanover, NH, April 1995.
Publications
- Automatic analysis, theme generation and summarization of
machine-readable texts. Science, 264, 3 (June 1994), 1421-1426
(with J. Allan, C. Buckley, and A. Singhal).
- The effect of adding relevance information in a relevance feedback
environment. Proceedings SIGIR-94, W.B. Croft and C.J. van Rijsbergen,
editors. Springer-Verlag, London, (July 1994), 292-301 (with
C. Buckley and J. Allan).
- Automatic text decomposition and structuring. Proceedings RIAO-94
Conference, CID, Paris, October 1994, 6-20 (with J. Allan).
- Automatic query expansion using SMART - TREC-3. An Overview of the
Third Text Retrieval Conference (TREC 3), D.K. Harman, editor.
National Institute of Science and Technology. Special
Publication 500-225, Gaithersburg, MD, (1995), 69-80 (with C.
Buckley, J. Allan, A. Singhal).
Software
- The SMART text analysis and retrieval system is made available
free of charge for research purposes. Several hundred copies of
SMART (version 11) have been distributed and are used around the world.
Return to:
1994-1995 Annual Report Home Page
Departmental Home Page
If you have questions or comments please contact:
www@cs.cornell.edu.
Last modified: 16 October 1998 by Elly Cramer, Cornell University, ADM
(elly@cs.cornell.edu).