Mixed Content and Mixed Metadata
Information Discovery in a Messy World

Caroline R. Arms
Library of Congress
caar@loc.gov

William Y. Arms
Cornell University
wya@cs.cornell.edu


Overview

As digital libraries grow in scale, heterogeneity becomes a fact of life. Content comes in a bewildering variety of formats. It is organized and managed in innumerable different ways. Similarly, metadata comes in a broad variety. Its quality and completeness vary greatly. Digital libraries must find ways to accept materials from diverse collections, with differing metadata or with none, and provide users with coherent information discovery services.

To understand how this may be possible, it is important to recognize the overall process by which an intelligent person discovers information. If information discovery is considered synonymous with searching, then the problem of heterogeneity is probably insuperable. By considering the complete process, most of the difficulties can be tackled.

Mixed content, mixed metadata

Searching: the legacy of history

Many of the metadata systems in use today were originally developed when the underlying resources described were in physical form. If a reader has to wait hours for a book to be retrieved from library stacks, it is vital to have an accurate description, to be confident of requesting the correct item. When the first computer-based abstracting and indexing services were developed for scientific and professional information, information resources were physical items. These services were aimed primarily at researchers, with an emphasis on comprehensive searching, that is, on high recall. The typical user was a medical researcher or lawyer who would pay good money to be sure of finding everything relevant to a topic. The aim was high recall through a single, carefully formulated search.

Although there are wide differences in the details, the approaches developed for library catalogs and early information services all employed careful rules for human cataloging and indexing, heavily structured metadata, and subject access via controlled subject vocabularies or classification schemes. The underlying assumption was that users would be trained or supported by professional librarians. As recently as the early 1990s, most services had the following characteristics:

The demand for mixed content

As digital libraries have become larger, they have begun to amalgamate materials that were previously managed separately. Users now expect one-stop access to information, yet different categories of materials must still be handled differently because the mode of expression or nature of distribution demands specialized expertise or different workflow in libraries, whether the content is digital or not. Thus the Library of Congress has separate units, such as Prints and Photographs, Manuscripts, Geography and Maps, each managing a relatively homogeneous collection. The National Library of Medicine provides a catalog of MARC records for books and Medline as an index to journal articles. To knowledgeable users, these divisions pose few problems. However, to students, the general public, and scholars in areas not aligned with the category boundaries, the divisions can be frustrating and confusing.

Some digital libraries have been established explicitly to bring together materials from various sources and categories. For example, the National Science Foundation's National Science Digital Library (NSDL) collects information about materials of value to scientific education, irrespective of format or provenance [1]. To illustrate the variety, four NSDL collections based at Cornell University offer: data sets about volcanoes and earthquakes; digitized versions of kinematics models from the nineteenth century; sound recordings, images, and videos of birds; and mathematical theorems and proofs. Similar diversity arises even with more conventional library materials. American Memory at the Library of Congress includes millions of digital items of many different types: photographs, posters, published books, personal papers of presidents, maps, sound recordings, motion pictures, and much more [2]. Users of the NSDL or American Memory want to explore the digital collections as a whole, without needing to learn different techniques for different categories of material. Yet the conventional, flat approaches to searching and browsing are poorly adapted for mixed content.

Mixed content means mixed metadata

Given that information discovery systems must reach across many formats and genres, a natural impulse is to seek for a unifying cataloging and indexing standard. The dream would be a single, all-embracing standard that suits every category of material and is adopted by every collection. However, this is an illusion. Mixed metadata appears to be as inevitable as mixed content.

There are good reasons why different metadata formats are used for different categories of resources. Maps are different from photographs and sound recordings from journal articles. A set of photographs of a single subject may be impossible to distinguish usefully through textual metadata; the user is best served by a group of thumbnails. Digital forms, such as software, datasets, simulations, and web sites, each call for different practices. In the NSDL, many of the best-managed collections were not intended for educational use; a taxonomy of animal behavior designed for researchers is of no value to school children. Many valuable resources in NSDL have no item-level metadata. In American Memory, the records for 47,000 pieces of sheet music registered for copyright between 1870 and 1885 are brief, with an emphasis on music genre and instrumentation. In contrast, the 3,042 pieces of sheet music in another American Memory collection were selected from collections at Duke University to present a significant perspective on American history and culture; the cataloging includes detailed description of the illustrated covers and advertisements.

Reconciling the variety of formats and genres would be a forbidding task even if it were purely a matter of schemas and guidelines, but there are other forces behind mixed metadata: the social context. History is littered with metadata proposals that were technically excellent but failed to achieve widespread adoption for social and cultural reasons. One social factor is economic. Well-funded research fields, such as medicine, have the resources to abstract and index individual items (e.g., journal articles) and to maintain tools such as controlled vocabularies and subject headings, but the rich disciplines are the exceptions. Even major research libraries cannot afford to catalog every item fully. For example, the Prints and Photographs Division of the Library of Congress often creates catalog records for groups of pictures or uses very brief records for items in large collections. A second social factor is history. Catalogs and indexes represent an investment, which includes the accumulated expertise of users and librarians, and the development of computer systems. For instance, Medline, Inspec and Chemical Abstracts services index journal articles, but the services developed independently with little cross-fertilization. Unsurprisingly, each has its own conventions for description and indexing appropriate to the discipline. Any attempt to introduce a single unifying scheme would threaten upheaval and meet with resistance.

Metadata consistency

While the dream of a single metadata standard is an illusion, attempts to enhance consistency through the promotion of guidelines within communities and coordination across communities can be extremely valuable. The last decade provides many examples where benefits from metadata consistency have been recognized and steps taken to harmonize usage in specific areas

Developments in the library community

Two structural developments for MARC records have enhanced consistency. During the period 1988 to 1995, format integration brought the variants of USMARC used for monographs, serials, music, visual materials, etc. into a single bibliographic format. In the late 1990s, the Library of Congress, National Library of Canada, and the British Library agreed to pursue MARC harmonization to reduce the costs of cataloging, by making a larger pool of catalog records available to be shared among libraries. One outcome was the MARC 21 format, which superseded USMARC and CAN/MARC. The motivation for these efforts was not explicitly to benefit users, but users have certainly benefited because systems are simpler when metadata elements are used consistently.

Other valuable modifications to the MARC standard have been made for compatibility with other metadata schemas or interoperability efforts. Some changes support mappings between MARC and FGDC, and between MARC and Dublin Core [3]. Others support citations to journal articles in convenient machine-parsable form, to use with the OpenURL standard and to allow detail to be preserved in conversions to MARC from schemas used in citation databases [4].

Recently, in response to demand from the library community, the Library of Congress has developed an XML-based metadata schema that is compatible with MARC, but simpler. The Metadata Object Description Schema (MODS) includes a subset of MARC elements and inherits MARC semantics for those elements [5]. Inherited aspects that are particularly important for American Memory include the ability to express the role of a creator (photographer, illustrator, etc.) and to specify place names in tagged hierarchical form (e.g. <country><state><city>). In some areas, MODS offers extensions and simplifications of MARC to meet known descriptive needs, including some for American Memory. MODS provides for more explicit categorization of dates, all expressible in machine-readable encodings. Coordinates can be associated with a place name. A valuable simplification is in the treatment of types and genres for resources; a short list of high-level types allowed in one element, and all other genre terms and material designators in another. American Memory users have frequently requested better capabilities for filtering by resource type, both in specifying queries and in organizing result lists. Based on these features, and building on an internal harmonization effort, a migration to MODS is expected for American Memory collections for which MARC is not used.

The potential benefits of an XML-based schema are significant. Libraries can take advantage of general-purpose software tools available for XML. Since in XML, any element can be tagged with the language of its content and the full Unicode character set is allowed, MODS permits the assembly of multilingual records.

Federated searching vs. union catalogs

Federated searching is a form of distributed searching. A client system sends a query to several servers. Each server carries out a search on the indexes that apply to its own collections and returns the results to the client, which combines them for presentation to the user. This is sometimes called metasearch or broadcast searching. Recently, systems have emerged, called portal applications, that incorporate a wide variety of resources into a federated search for library patrons. Most federated searching products take advantage of the Z39.50 protocol [6].

In recent years, standard profiles for Z39.50 index configurations have been developed by the library community in the hope of persuading vendors to build compatible servers for federated searching. These profiles represent a compromise between what users might hope for and what vendors believe can be built at a reasonable cost. The Bath Profile was originally developed in the UK and is now maintained by the National Library of Canada; a comparable U.S. National Standard is under development under the auspices of National Information Standards Organization [7].These profiles focus on a few fields (e.g., author, subject, title, standard number, date of publication). In the Bath Profile, the highest level for bibliographic searching also includes type of resource, and language. A keyword search on any field covers all other fields.

The effectiveness of federated searching is limited by incompatibilities in the metadata or the index configurations in the remote systems. Client applications have to wait for responses from several servers and usually receive only the first batch of results from each server before presenting results to a user. Busy users are frustrated by having to wait and experienced users are frustrated by the inability to express complex queries. Duplicates are a problem when several sources return records for the same item. While federated searching is useful for small numbers of carefully managed collections, it proves unworkable when the number and variety of collections increases.

American Memory and NSDL both use a different approach, following the pattern of union catalogs in gathering metadata records from many sources to a single location. Both digital libraries have to address inconsistencies in metadata, but have the advantage of doing so in centrally controlled systems.

Cross domain metadata: the Dublin Core

Dublin Core represents an attempt to build a lingua franca that can be used across domains. To comprehend how rapidly our understanding is changing, it is instructive to go back to early days of the Web. As recently as 1995, it was recognized that the methods of full text indexing used by early Web search engines, such as Lycos, would run into difficulties as the number of Web pages increased. The contemporary wisdom was that "... indexes are most useful in small collections within a given domain. As the scope of their coverage expands, indexes succumb to problems of large retrieval sets and problems of cross-disciplinary semantic drift. Richer records, created by content experts, are necessary to improve search and retrieval" [8]. With the benefit of hindsight, we now see that the Web search engines have developed new techniques and have adapted to huge scale while cross-domain metadata schemes have made less progress.

For the first phase of the NSDL development, collection contributors were encouraged to provide Dublin Core metadata for each item and to make these records available for harvesting via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [9]. While this strategy enabled the first phase of the library to be implemented rapidly, it showed up some fundamental weaknesses of the Dublin Core approach [10]. Although each component (Dublin Core and OAI-PMH) is intended to be simple, expertise is needed to understand the specifications and to implement them consistently, which places a burden on small, lightly staffed collections. The granularities and the types of the objects characterized by metadata vary greatly and the metadata quality is highly variable. When contributors have invested the effort to create fuller metadata (e.g., to one of the standards that are designed for learning objects), valuable information is lost when it is mapped into Dublin Core. The overall result has been disappointing for information discovery. Search engines work by matching a query against the information in the records being searched. Many Dublin Core records contain very little information that can be used for information discovery.

Information discovery in a messy world

Fortunately, the power of modern computing, which makes large-scale digital libraries possible, also supports new capabilities for information discovery. Two themes are particularly significant:

Advances in information retrieval

When materials are in digital formats it is possible to extract information from the content by computer programs. Automated full-text indexing is an approach to information retrieval that uses no metadata [11]. The actual words used by the author are taken as the descriptors of the content. The basic technique measures the similarity between the terms in each document and the terms in the query. Full-text search engines provide ranked lists of how similar the terms in the documents are to those in the query. As early as 1967, Cleverdon recognized that, in some circumstances, automated indexes could be as effective as those generated by skilled human indexers [12]. This counter-intuitive result is possible because an automated index, containing every word in a textual document, has more information that a catalog or index record created by hand. It may lack the quality control and structure of fields that are found in a catalog record, but statistically the much greater volume of information provided by the author's words may be more useful than a shorter surrogate record.

By the early 1990s there were two well established methods for indexing and searching textual materials: fielded searching of metadata records and full-text indexing. Both built on the implicit expectation that information resources were divided into relatively homogeneous categories of material; search systems were tuned separately for each category. Until the development of the Web, almost all information retrieval experiments studied homogeneous collections. For example, the classical Cranfield experiments studied papers in aeronautics [12]. When the Text Retrieval (TREC) Conferences in the 1990s carried out systematic studies of the performance of search engines, the test corpora came from homogeneous sources, such as the Associated Press newswire, thus encouraging the development of algorithms that perform well on homogeneous collections of documents [13].

Web search services combine a Web crawler with a full-text indexing system. For example, the first version of Lycos used the Pursuit search engine developed by Mauldin at Carnegie Mellon [14]. This was a conventional full-text system, which had done well in the TREC evaluations. There were two repeated complaints about these early systems: simple searches resulted in thousands of hits, and many of the highly ranked hits were junk. Numerous developments have enabled the search services to improve their results, even as the Web has grown spectacularly. Four developments, in particular, have general applicability:

Understanding how and why users seek for information

The conventional measures of effectiveness, such as precision and recall, are based on a binary interpretation of relevance. A document is either relevant or not, and all relevant documents are considered equally important. With such criteria, the goal of a search system is to find all documents relevant to a query, even though a user’s information need may be satisfied by a single hit.

In their original Google paper, Brin and Page introduced a new criterion [15]. They recognized that, with mixed content, some documents are likely to be much more useful than others. In a typical Web search the underlying term vector model finds every document that matches the terms in the query, often hundreds of thousands. However, the user looks at only the most highly ranked batches of hits, rarely more than a hundred in total. Google's focus is on those first batches of hits. The traditional objective of finding all relevant documents is not the goal.

With homogeneous content all documents were assumed equally important. Therefore they could be ranked by how similar they were to the query. With mixed content, many documents may be relevant, but not all of them are equally useful to the user. Brin and Page give the example of a web page that contains three words, "Bill Clinton sucks." This page is undoubtedly similar to the query "Bill Clinton". However, it is unlikely to be much use. Therefore, Google estimates the importance of each page, using criteria that are totally independent of how well the page matches a query. The order in which pages are returned to the user is a combination of these two rankings: similarity to the query and importance of the document.

Relationship and context

Information resources always exist in a context and are often related to others. A monograph is one of a series; an article cites other articles; customers who buy a certain book often buy related ones; reviews describe how people judge resources. In formal catalogs and bibliographies, some relationships are made explicit; automated indexing of online content permits the inference of other relationships from context. Google's image search is an intriguing example of a system that relies entirely on context to search for images on the Web. The content of images cannot be indexed reliably and the only metadata for an image on a web page is the name of the file, but such images have considerable context. This context includes text in anchors that refer to the image, captions, terms in nearby paragraphs, etc. By indexing the terms in this contextual information, Google image search is often able to find useful images.

On the Web, hyperlinks provide relationships between pages that are analogous to citations between papers [16]. Google's well-known PageRank algorithm estimates the importance of a web page by the number of other web pages that link to it, weighted by the importance of the linking pages and the number of links from each page. The Teoma search engine uses hyperlinks in a different way. After carrying out a text search, it analyses a set of several thousand of the highest ranking results and identifies pages that many other pages link to.

Citations and hyperlinks are examples of contextual information embedded within documents. Reviews and annotations are examples of external information. Amazon.com has been a leader in encouraging the general public to provide such information. The value of information contributed by outsiders depends on the reputation of the contributor.

These techniques for exploiting context require powerful computation.

Multimodal information discovery

With mixed content and mixed metadata, the amount of information about the various resources varies greatly. Many useful features can be extracted from some documents but not all. For example, a <title> field in a web page provides useful information, but not all pages have <title> fields. Citations and hyperlinks are valuable when present, but not all documents have them. Such features can be considered clues. Multimodal information discovery methods combine information about various features of the collections, using all the information that is available about each item. The clues may be extracted from the content, may be in the form of metadata, or may be contextual.

The term "multimodal information discovery" was coined by Carnegie Mellon's Informedia project. Informedia has a homogeneous collection, segments of video from television news programs, but, because it is based on purely automated extraction from the content, topic-related metadata varies greatly [17]. The search and retrieval process combines clues derived automatically in many ways. The concept behind the multimodal approach is that " the integration of … technologies, all of which are imperfect and incomplete, would overcome the limitations of each, and improve the overall performance in the information retrieval task" [18].

Web search services also use a multimodal approach to ranking. While the technical details of each service are trade secret, the underlying approaches combine conventional full-text indexing with contextual ranking, such as PageRanks, using every clue that they can find, including anchor text, terms in titles, words that are emphasized or in larger font, and the proximity of terms to each other.

User interfaces for exploring results

Good user interfaces for exploring the results of a search can compensate for many weaknesses in the search service, including indifferent or missing metadata. It is no coincidence that Informedia, where the quality of metadata is inevitably poor, has been one of the key research projects in the development of user interfaces for browsing.

Perhaps the most profound change in information discovery in the past decade is that the full content of many resources is now online. When the time to retrieve a resource has gone from minutes, hours, or even days, to a few seconds, browsing and searching are interwoven. The Web search services provide a supreme example. Weak by many of the traditional measures, they nevertheless provide quick and direct access to information sources that the user can then explore independently. A common pattern is for a user to type a few words into a Web search service, glance through the list of hits, examine a few, try a different combination of search terms, and examine a new set of hits. This rapid interplay between the user's expertise and the computing tools is totally outside the formal analysis of single searches that is still the basis of most information retrieval research.

The user interface to RLG's Cultural Materials resource provides a different example [19]. It consists of a simple search system, supported by elegant tools for exploring the results. The search system acts as a filter that reduces the number of records that the user is offered to explore. In the past, a similar system would almost certainly have provided a search interfaces with advanced features that a skilled user could use to specify a very precise search. Instead RLG has chosen a simple search interface and an easily understood interface for exploring the results. Neither requires a skilled user. The objective is flexible exploration rather than precise retrieval.

Yet another area where Google has advanced the state-of-the-art in information discovery lies in the short records that are returned for each hit. These are sometimes called "snippets". Each is a short extract from the web page, which summarizes it so that the user can decide whether to view it. Most services generate the snippets when the pages are indexed. For a given page, the user always receives the same snippet, whatever the query. Google generates snippets dynamically, to include the words on the page that were matched against the query.

Case study: the NSDL

The NSDL provides an example of many of the approaches in discussed above. The library has been explicitly designed on the assumption of heterogeneity [20]. The centerpiece of the architecture is an NSDL repository, which is intended to hold everything that is known about every item of interest. As of 2003, the repository holds metadata in only a limited range of formats; considerable emphasis has been placed on Dublin Core records, both item-level and collection-level. The first search service combines fielded searching of these records with a full-text index of those textual documents that are openly accessible for indexing. Three improvements are planned for the near term: expansion of the range of metadata formats that are accepted, improved ranking, and dynamic generation of snippets. In the medium-term, the major development will be the addition of contextual information, particularly annotations and relationships. For an educational digital library, recommendations based on practical experience in using the resources are extremely valuable. Finally, various experiments are under way to enhance the exploration of the collections; visualization tools are particularly promising.

The target audiences are so broad that several portals or views into the same digital library are planned. Moreover, the NSDL team hopes and expects that users will discover NSDL resources in many ways not only by using the tools that the NSDL provides, such as via Web search services. An example is a browser extension tool that enables users to see whether resources found in other ways are in the NSDL, e.g., when a page of results is returned from Google the user clicks the tool and each URL on the page that references an NSDL resource has a hyperlinked logo appended to it.

Implications for the future

In summary, as digital libraries grow larger, information discovery systems must increasingly assume the following characteristics:

Perhaps the most important conclusion is that successful information discovery depends on the inter-relationship between three areas: the underlying information (content and metadata), computing tools to exploit both the information and its context, and the human-computer interfaces that are provided. Most of this book is about the first of these three, the relationship between content and metadata, but none of them can be studied in isolation.

Acknowledgements

This paper synthesizes ideas that we have gained in working on American Memory and the NSDL and from many other colleagues. This work was supported in part by the National Science Foundation, under NSF grant 0127308.

References

[1] National Science Digital Library, http://nsdl.org/

[2] American Memory, http://memory.loc.gov/

[3] MARC21 Formats, http://www.loc.gov/marc/marcdocz.html

[4] OpenURL, http://www.sfxit.com/openurl/openurl.html

[5] MODS, http://www.loc.gov/standards/mods/

[6] Z39.50, http://www.loc.gov/z3950/agency/

[7] The Bath Profile: An International Z39.50 Specification for Library Applications and Resource Discovery, Release 2.0, maintained by the Bath Profile Maintenance Agency, Library and Archives of Canada, March 2003.
http://www.nlc-bnc.ca/bath/tp-bath2-e.htm

[8] Stuart Weibel, Metadata: the foundations of resource description. D-Lib Magazine, vol. 1, no. 1, July 1995.
http://www.dlib.org/dlib/July95/07contents.html

[9] Open Archives Initiative Protocol for Metadata Harvesting,
http://www.openarchives.org/OAI/openarchivesprotocol.html

[10] William Y. Arms, Naomi Dushay, Dave Fulker, and Carl Lagoze, A Case Study in Metadata Harvesting: the NSDL. Library HiTech vol. 21, no. 2, 2003.
http://www.cs.cornell.edu/wya/papers/LibHiTech-2003.doc

[11] Gerald Salton and Michael J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.

[12] Cyril William Cleverdon, The Cranfield tests on index language devices, ASLIB Proceedings, vol. 19, no. 6, pp 173-194, June 1967.

[13] E. Voorhees and D. Harman, Overview of the Eighth Text REtrieval Conference (TREC-8). 1999. http://trec.nist.gov/pubs/trec8/papers/overview_8.ps

[14] Michael L. Mauldin, Lycos: Design Choices in an Internet Search Service. IEEE Expert, vol. 12, no. 1, pp 8-11, 1997.

[15] Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seventh International World Wide Web Conference. Brisbane, Australia, 1998.
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

[16] Eugene Garfield, Citation Indexing: Its Theory and Application in Science, Technology, and Humanities. Wiley, New York, 1979.

[17] Informedia, http://www.informedia.cs.cmu.edu/

[18] Howard Wactlar, Informedia - Search and Summarization in the Video Medium. Proceedings of Imagina 2000 Conference, Monaco, January 31 to February 2, 2000.
http://www.informedia.cs.cmu.edu/documents/imagina2000.pdf

[19] RLG Cultural Materials, http://cmi.rlg.org/

[20] William Y. Arms, et al., (2002), A Spectrum of Interoperability: The Site for Science Prototype for the NSDL. D-Lib Magazine, vol. 8, no. 1, January 2002.
http://www.dlib.org/dlib/january02/arms/01arms.html


Caroline R. Arms
William Y. Arms

December 9, 2003