Carl Lagoze
Research Summary
Currently I am involved in the following research projects:
Representing, Storing, and Disseminating Digital Information
For the past ten years I have been examining digital object architectures in collaboration with Sandy Payette and others. Fedora is the major result of this work. Fedora is a general purpose repository service developed jointly by The University of Virginia Library and Cornell University. Fedora open source software gives organizations flexible tools for managing and delivering their digital content. At its core is a powerful digital object model that supports multiple views of each digital object and the relationships among digital objects. Digital objects can encapsulate locally-managed content or make reference to remote content. Dynamic views are possible by associating web services with objects. Digital objects exist within a repository architecture that supports a variety of management functions. All functions of Fedora, both at the object and repository level, are exposed as web services. These functions can be protected with fine-grained access control policies. We continue to evolve the Fedora architecture and explore its deployment in a variety of contexts.
Metadata Harvesting and Reuse
During the 1990's I was an active participant in the Dublin Core Metadata Initiative. The combination of my experience in developing metadata standards and my interest in interoperability protocols led to my work with Herbert Van de Sompel and others in the formulation of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH provides an application-independent interoperability framework based on metadata harvesting. It is an internationally recognized standard used by a variety of communities that allows providers of digital information to expose XML-based metadata that may be used to populate search engines and other web-based services. We continue to examine OAI-PMH as the basis for dissemination of more complex digital information formats.
Digital Library Design and Deployment
My earliest work with Dienst and NCSTRL examined basic issues in designing and deploying distributed digital libraries. My current work with the National Science Digital Library (NSDL) provides a focus for investigating digital library design and deployment on a very large scale. Early work in the NSDL examined the issues of production deployment of OAI-PMH and ensuing issues of quality of metadata harvested from distributed sources. Our current NSDL work is investigating the notion of an Information Network Overlay Architecture. In this model a digital library is represented in Fedora as a graph, where the nodes are digital objects representing entities in the library (agents, resources, metadata, etc.) and the arcs are typed relationships among these entities (collection membership, annotation, metadata provision, etc.). Our goal in this work is to implement a digital library infrastructure that fully represents the context and semantic relationships of digital information.
Automated Analysis and Organization of Web Information
In the spirit of Bill Arms' Automated Digital Libraries I am increasingly interested in digital libraries that are automatically constructed overlays over web-based resources. We are particularly interested in this notion in the NSDL, where our attempts to rely on human-centered metadata creation have produced both quality and scalability problems. In this context, I am working with the iVia/Infomine group at UC Riverside, who are developing tools for focused crawling, automatic classification and automatic metadata creation. My research work in this area has two threads. First, I am working with Ph.D. student Pavel Dmitriev to examine tools for automatically grouping URLs on the web into compound documents, that more logically represent the information units as perceived by humans. Second, I am working with Ph.D. student Selcuk Aya to examine tools for classifying citations in scholarly documents according to their meaning, with the goal of producing citation graphs that are a basis for quality assessment and understanding research provenance.
New Models for Scholarly Communication
The ubiquity of the web and institutional repositories provides the opportunity for a significant restructuring of scholarly communication. My early work with NCSTRL that explored distributed repositories for scholarly "grey literature" was an early investigation in this area. Our latest work, in the context of the NSF-funded Pathways project, is examining a natively digital, network-based scholarly communication system that is able to capture the digital scholarly record, make it accessible, and preserve it over time becomes evident. The Pathways project will develop broadly applicable models and protocols to support a loosely-coupled, highly distributed, interoperable scholarly communication system. A graph-based information model will provide a layer of abstraction over heterogeneous resources (data, content, and services). A service-oriented process model will enable the expression and invocation of multi-stage compositional, computational, and transformational information flows. Motivation for this work was described in a D-Lib article Rethinking Scholarly Communication: Building the System that Scholars Deserve.