|
|
|
|
We deliberately mention abuses, because the simple definition of metadata as "data about data" causes a number of problems. To be more precise, these problems arise not from the definition, but from unspoken assumptions about the data being described and the purposes the description is to serve. For example, the massive storage community considers metadata to be the information on how a terascale dataset has been sliced and diced for storage onto a set of tapes. Statisticians consider metadata to be information on how the samples for a set of experimental measurements were prepared, as well as information on known biases in experimental measurements that need to be corrected. Database designers consider metadata to be relational schemas and data dictionaries, while librarians consider it to be catalog records. Even within a community there will be a number of different purposes to be achieved, and forms of metadata developed to meet those purposes. As an example, librarians not only have catalog records for the items in their collection, they have controlled vocabularies for subject descriptions and author names, and special statements on the terms and conditions for accessing a work.
This diversity of datasets and purposes holds little hope of a universal taxonomy for metadata, short of the development of a theory of everything. Yet there are application areas, such as digital libraries, where implementers will need to be able to deal with all the varieties of metadata mentioned above, and no doubt many more. As one example, Los Alamos National Laboratory is developing a Scientific Data Management system that must deal with:
Other applications, such as a system of interoperable digital library repositories, also require a reasonably general approach to the metadata problem. We already see that documents are becoming more like assemblies of software components [Extreme]. This is best seen in modern web documents, where Java applets and JavaScript functions are used to interact with remote servers. The goal of this componentization is two-fold; to better communicate with the reader, and to allow the reuse of the components in many documents. A repository architecture that can accommodate tailoring documents to the capabilities of the reader's system, as well as control how components are re-used, will need a very flexible architecture if it is to remain viable for any length of time.
What is needed is a model for metadata that allows new forms to be dealt with in a fashion similar to existing forms, a model that fully exploits the unique combination of computation and connectivity that characterizes the digital library. In this paper, we describe an extension of the Warwick Framework that we call Distributed Active Relationships (DARs). DARs provide a powerful model for representing data and metadata in digital library objects. They explicitly express the relationships between networked resources. New relationships may defined anywhere in the network, and may even be defined in such a way that they could be dynamically downloaded and executed in a manner analogous to a Java applet.
The DAR model is based on the following principles, which our examination of the "data about data" definition has led us to regard as axiomatic:
Coordinating metadata development across all those domains is impossible. Therefore the creation, administration, and enhancement of individual metadata forms should be left to the relevant communities of expertise. Ideally this would occur within an framework that will support interoperability across data and domains. The Warwick Framework (WF) provides just such a modular approach to metadata.
The Warwick Framework originated from an attempt, at the Second Invitational Metadata Workshop [WW], to define an extension mechanism for the Dublin Core Metadata Element Set [DC] in order to prevent unrestricted growth in its complexity. Named after the site of the workshop in Warwick, the WF tackles the extension problem by aggregating typed metadata packages into containers. The WF defines three types of package:

Figure 1 illustrates a simple example of a Warwick Framework Container. The container contains three logical packages of metadata. The first two, a Dublin Core record and a MARC record, are physically in the container. The third metadata package, which defines the terms and conditions for access to a content object, is referenced in the container indirectly via a URI.
The framework is a simple concept, but it has important implications
for interoperation, and as the basis for long-lived metadata systems. By
factoring complex descriptions into simpler components, interoperation
can be addressed at a component level, rather than at an "all or nothing"
monolithic level. The framework also allows for lowest- common-denominator
descriptions, such as the Dublin Core, to exist beside complex descriptions
from specialized communities, such as MARC. Thus, members of the same community
can exchange their rich descriptions in preference to more general ones.
System evolution is facilitated since, as new purposes for datasets emerge,
new metadata schemas and formats can be developed instead of trying to
evolve an already-established schema. Instances of the new schemas can
be added as new packages to the container(s) associated with the dataset.
New handlers can be added to utilize the new package, and this can occur
without significant disruption to the metadata system architecture as a
whole.
To meet this need, we defined a new abstraction called the Warwick Framework Catalog (WFC). A WFC is a list of assertions about individual packages and the relationships between packages. Example relations are one package acting as a digital signature, bibliographic description, or access control specification for another package.
(bibliographic-description
package-1 package-2)
(terms-for-accessing
package-1 package-3)
(derived-via-transformation
package-1 package-6 package-5)
(digital-signature package-1
package-4)
(digital-signature package-6
package-7)
Listing 1 illustrates an example Warwick Framework Catalog. It shows
package-2 is a bibliographic description of package- 1, while package-3
provides the terms for gaining access to package-1. Relations need not
be binary, we might state that package-5 is derived from package-1 by a
transformation that is specified in package-6. The same relation might
hold between different sets of resources, as shown by the digital-signature
relation in the last two lines. Figure 2 shows a simple Warwick Framework
Container with a relationship package.

The WFC could be provided as the first package in a container, and would
provide enough information to the receiver to allow proper treatment of
the remaining packages. Although the example above uses an s-expression
syntax, the WFC is essentially another conceptual model that can be expressed
in a number of ways. The key contribution of the WFC is that it leads to
some far-reaching generalizations to the Warwick Framework. Those generalizations
are described in the next two sections.
A better approach is to consider the information architecture as a collection of inter-related resources. While these resources may have a type, such as PostScript, HTML, or a Java program, this type is orthogonal to whether the resource is acting as data vs. metadata in some context. That contextual information is specified by the relationships between the resources. We can model these inter-related resources using directed graphs, where nodes represent the resources and the labeled arrows between nodes represent the relationships. Since a resource may be related to many other resources, nodes may have many arcs originating from or terminating at them. Looking at the direction of an arrow, it is easy to see whether a resource is playing the role of data or metadata in the context of that particular relationship. We can easily accommodate such a model by generalizing the Warwick Framework so that it may contain any resources, not just those considered "metadata". Thus, we can use the Warwick Framework Catalog to specify the relationships between various resources, both inside and outside the container.
As a simple example (we will use more complex examples later), assume that the relationship arcs are uni-directional and that the only relationship they specify is "has-metadata". Figure 3 shows a set of resource nodes and relationship arcs that correspond to the Siskel and Ebert movie review mentioned earlier. For the moment, ignore the three overlapping ovals in the figure. As illustrated, certain resource nodes have both outgoing and incoming arcs; thus they are "data" in one context and "metadata" in another. For example, the Siskel and Ebert review is metadata for the movie "Men in Black", but the review has metadata of its own (it is acting as "data" relative to a Dublin Core record and a Terms and Conditions specification).

We can take a different perspective on Figure 3 and formulate three digital library resources, which can be found through resource discovery and accessed using unique identifiers (such as URLs and URNs). Each of these resources aggregates data and related metadata. These aggregations, shown by the overlapping ovals, are:

In generalizing the Warwick Framework as a digital object container, we emphasize two features and then introduce a significant extension.
First, recall that the Warwick Framework places no locality restriction
on the packages that it "contains". A package may either be physically
in a container or indirectly referenced via a URI (thus, it might be located
anywhere in the global information space). This is demonstrated in Listing
2, in which the relationships in a Warwick Framework Catalog refer to resources
using URIs as well as internal package references. Figure 5 illustrates
a digital object container that references, through the relationship catalog,
a component of an external digital object. One interesting manifestation
of this is that a container, or digital object, may actually
have no physically contained data sets, but may act merely as a logical
container with only relationships that reference remote data sets.
Second, the example in Figure 3 illustrates only one simple type of uni-directional relation, the "has-metadata" relation. However, as we have emphasized throughout our work on the Warwick Framework, the notion that something "is metadata" does little to convey its actual meaning and, therefore, such a simple relationship should be avoided. The Warwick Framework Catalog can include a variety of relationships with much richer semantics, such as "terms-for-accessing", "bibliographic-description", and the like.
(bibliographic-description package-1 URI-1)
(terms-for-accessing package-1 URI-2)
Up to this point, we have assumed that the relationships in the Warwick
Framework catalog are identified with simple names, which might be listed
in some registry. A more general solution is to let the relationship names
be URIs. This provides a scoping mechanism to preclude name clashes. More
interestingly, it opens up the possibility of making the relations into
resolvable first-class resources in their own right. These "relation resources"
might have their "metadata" including access controls and descriptions.
In this scenario, the simple relationship arcs illustrated in Figure 3
become nodes in their own right, with possible
relationships to other data nodes. In the next section, we extend this
notion even further by describing executable relationships that enable
dynamic and interpretable data and metadata.

The best way to describe the motivation and use of DARS is to apply them to a well-known problem, rights management. Managing intellectual property rights for digital library objects is complex, and we refer the reader to [GLAD] for a more thorough treatment of the subject. At one end of the spectrum, rights management metadata may be a simple textual description, say that used in "shrink-wrap" licenses. At the other end, there are complex access control schemes that may involve interaction and negotiation with authentication services, billing services, agents, etc. Any reasonable architecture for networked information management must accommodate the full set of rights management possibilities.
One approach to this problem is executable rights management metadata. The metadata returned to a client could be an executable object, or a handle to an executable object using distributed object technology such as CORBA. Using this executable metadata, the client may present, obtain, or negotiate the proper certificates or authorization to access the content of the digital object. During this process, the executable metadata may contact other services that are necessary to obtain the certification or authorization.
Figure 6 illustrates a Distributed Active Relationship that manages the access rights to a resource. In this case, the rights management scheme is based on the notion of an access control list. Note the separation between the access control list in the package labeled P2 and the mechanism for the enforcement, which is in the external relation object. Also note that the relation object is a digital object in its own right, referenced via a global identifier, URN1 in the relationship catalog. The activation package in the figure stands for an executable component of the relation that would be invoked when a client accesses the content in the package labeled P1. The description package in the relationship container might be some textual description of the relationship. Section 6.1 describes one possible implementation of such a rights management mechanism.
An important component of this rights management scheme, and for the
DAR concept in general, is that the executable aspect of the DAR is external
to the resource being accessed and to the repository containing the resource.
This level of modularization maximizes code reuse and extensibility. This
means that not all contingencies and consequences need be anticipated before
an object is released. Rather, a rights holder may add to or subtract from
the metadata as circumstances change and new services become available.
Section 6.1 describes a digital library repository architecture that implements
this scheme for rights
management.

Another consequence of the DAR model is that metadata packages can be virtual or dynamic [LAG]. That is, the package data may only exist as the result of a computation on some other resource. For example, we might state that both MARC and Dublin Core descriptions of a resource are available. The Dublin Core description could be computed on-demand from the MARC description. Active relationships can capture the dependency of the virtual Dublin Core package on the MARC package. This is similar in concept and could be applied to the notion of "Just-in-time Conversion" addressed in [PW]. For this purpose, a single underlying format, such as a scanned image, could be associated with several different DARs that on-demand can convert the object to a variety of formats such as JPEG, GIF, or OCR-ed text.
While the DAR model is intriguing, there are three problem areas that
must be addressed in practical implementations.
The FEDORA architecture is designed to enable interoperability by three means: (1) supporting the aggregation of heterogeneous, distributed content, (2) providing a means for attaching extensible behaviors to a digital object, and (3) providing a mechanism for associating externally-supplied rights enforcement mechanisms with the digital object to protect intellectual content.
Essentially, FEDORA Digital Objects are designed to avoid functional obsolescence through this distinction between the internal form of digital content, and the disseminated form. While the raw content of an object will persist over time, the behaviors of the object (and the requests users can make upon on the objects) will change. So, not only can Digital Objects evolve by incorporating new content forms, but also they can exhibit new behaviors through the ability to "plug in" new Interfaces. For instance, today an object may assert its ability to produce of a Dublin Core record and a Postscript version of its content. Later, through the unplugging of old or obsolete interfaces, and the addition of newly developed interfaces, the same base object may have a different set of behaviors. For example, when accessed, it may announce that it can now disseminate a Dublin Core record, an XML-wrapped document, and a newly developed high-compression image format of the content.
As previously mentioned, Interfaces are one way in which FEDORA uses the abstraction of DARs. An Interface embodies all the requisite relationship information expressed in the semantics of a DAR. It endows a Digital Object with the ability to disseminate content by specifying which Datastreams are related, the nature of the relationships, and the operational semantics of the relationships. In Figure 7, for example, there is a Distributed Active Relationship that returns a MARC record describing the content in DS4, and one that returns a Dublin Core record that is computed on-the-fly from the MARC record. Any number of Interfaces can be linked to a Digital Object enabling it to perform specialized operations. Without knowing any of the structural details of a digital object, a client could discover that the object will produce a number of views of itself, such as: a watermarked image of a identified graphic; a particular page of document content in Postscript format; or a visualization of a dataset.
To maximize interoperability and long-term viability of digital objects,
FEDORA provides a modular, extensible rights management architecture that
is not dependent on any particular security scheme. Just as behaviors can
be added or removed from Digital Objects over time, the means of securing
these behaviors can change and adapt as security applications mature. Enforcers
are first-class objects that are stored persistently with each Interface
linked to a Digital Object. As such, they can take advantage of enforcement
mechanisms that live outside of FEDORA; Enforcers serve as the means
of connecting FEDORA Digital Objects to any number of external rights management
services.

Figure 7 shows an example of a simple Enforcer that secures the Postscript behaviors of the depicted Digital Object. The relevant DataStream (DS4) is protected by an Enforcer that is wrapped around the Postscript Interface. Again, the Enforcer is directly securing the behaviors of the object (e.g., getPage, getContent), thus indirectly protecting the content. Also, it should be noted that the Enforcer implements the same rights management DAR described earlier (see Figure 6). It links an Access Control List (DS1) with an external enforcement engine (remote Datastream pointing to URN1) and protects the intellectual content (DS4) by running the Enforcer every time the Postscript Interface methods are invoked. While the Access Control List is stored in the Digital Object as a distinct content package, the mechanism for executing the Enforcer can exist outside the Digital Object, and optionally, outside of the FEDORA repository.
FEDORA is currently being implemented and will be tested in the context of a reference implementation that includes other key services (e.g., searching and name resolution) of an interoperable, distributed digital library architecture.
RDF has four components; the modeling facility, the serialization syntax, schema definitions, and rule definitions. Currently, a public draft for the modeling facility and syntax has been released [RDF], and the schema working group has begun its deliberations. The model and syntax draft will be revised in the near future to add a typing mechanism similar to that of modern Object-oriented programming languages once the interactions between typing and schemas have been specified.
Similar to the approach discussed in section 4, RDF models are directed
graphs. Nodes represent web resources, arcs state that certain properties
(such as "Author") are associated with a node, and arcs terminate either
at a node or at a string. As an example, Figure 8 shows a model for some
simple Dublin Core bibliographic information associated with a web page.

Listing 3 shows the serialized version of that model.
<?namespace href="http://www.purl.org/Metadata/DublinCore/"
as="DC"?>
<?namespace href="http://www.w3.org/Schemas/RDF/"
as="RDF"?>
<RDF:Serialization>
<RDF:Assertions href="http://www.acl.lanl.gov/~rdaniel/">
<DC:Creator>Ron Daniel Jr.</DC:Creator>
<DC:Publisher>Los Alamos National Laboratory</DC:Publisher>
</RDF:Assertions>
</RDF:Serialization>
One of the key features of RDF is its pervasive use of URIs. The namespace declarations in Listing 3 provide one indication of this. Tag names like DC:Creator expand to a 2-tuple composed of a URI, such as http://www.purl.org/Metadata/DublinCore, and the identifier "Creator". This give us scoped names, preventing confusion between differing definitons of terms like "Title". (Legal title is not the same as royal title, which in turn is different than the title of a book). Using URIs for the terms in a namespace also allows name space definitions to be fetched from the network.
In order to implement DARs in RDF we extend the name-space definition slightly by allowing scoped tag names to expand to a URI such as http://www.purl.org/Metadata/DublinCore/Creator. (The XML name space is only now being specified [XML2] and neither blesses nor precludes this extension.) With this extension, the arcs in RDF correspond toDARs. For example, the DC: Creator arc in Figure 8 can be expressed as a DAR through the 3-tuple scheme shown in Listing 4.
(http://www.purl.org/Metadata/DublinCore/Creator
- the arc type
http://www.acl.lanl.gov/~rdaniel/
- the source of the arc
"Ron Daniel Jr.")
- the dest. of the arc
Thus, RDF seems to provide the facilities needed to construct an active metadata system.
Los Alamos National Laboratory is currently prototyping the use of RDF and DARs for a large scientific data management system. One of the issues being considered at this time is the question of how to efficiently handle executable relations. Assume we have a repository similar to that of FEDORA, and that we wish to implement enforcers and interfaces. We can pick a particular form of executable content (such as Java class files) to support in our system. Determining the meaning of an executable relationship and deciding whether to run it remains a problem. As mentioned earlier, blindly executing all relationships would be foolish due to performance and security considerations. We can use RDF's typing system to indicate that particular relations are subclasses of known relationships such as "Enforcer" or "Interface". The security manager of our repository could look at the type of all DARs. Only those that are subclasses of known, pre-approved types would be executed. Therefore we can implement a security manager in our repository that will only execute relations when they are of a known type, giving us some indication of their meaning.
This foundation has proven very useful in the design of FEDORA, where it allowed a graceful and promising integration of such divergent notions as the Kahn/Wilensky Digital Library architecture and downloadable code (e.g. Java applets). We are particularly interested in the capabilities of the new Resource Description Framework to facilitate the construction of systems based on DARs. If it proves successful in prototypes, it could have an enormous impact on the design of metadata systems.
[Extreme] Computing and Communications in the Extreme: Research for Crisis Management and Other Applications; National Academy Press; Washington, D.C., 1996.
[WW] Metadata Workshop II, http://www.oclc.org:5046/oclc/research/conferences/metadata2/
[DC] Dublin Core Metadata Element Set Resource Page, http://purl.oclc.org/metadata/dublin_core/
[ARMS] Arms, William Y., "Key Concepts in the Architecture of a Digital Library", D-lib Magazine, July 1995, http://www.dlib.org/dlib/July95/07arms.html
[GLAD] H.M Gladney and J.B. Lotspiech, "Safeguarding Digital Library Contents and Users: Assuring Convenient Security and Data Quality", D-lib Magazine, May 1997, http://www.dlib.org/dlib/may97/ibm/05gladney.html
[LAG] Lagoze, Carl, "From Static to Dynamic Surrogates: DataStream Discovery in the Digital Age", D-Lib Magazine, June 1997, http://www.dlib.org/dlib/june97/06lagoze.html.
[PW] Price-Wilkin, John, Just-in-time Conversion, Just-in-case Collections: Effectively leveraging rich document formats for the WWW, D-lib Magazine, May 1997, http://www.dlib.org/dlib/may97/michigan/05pricewilkin.html
[DL] Daniel Jr., Ron and Carl Lagoze, "Distributed Active Relationships in the Warwick Framework", Proceedings of the 1997 IEEE Metadata Conference, September, 1997, http://computer.org/conferen/proceed/meta97/papers/rdaniel/rdaniel.pdf
[KWF] Kahn, Robert and Robert Wilensky, "A Framework for Distributed Digital Object Services", Corporation for National Research Initiatives, http://www.cnri.reston.va.us/cstr/arch/k-w.html
[W3R] Press Release, W3C announces RDF, http://www.w3.org/Press/RDF
[XML] Extensible Markup Language (XML), World Wide Web Consortium, http://www.w3.org/XML/
[RDF] Lassila, Ora and Ralph R. Swick, "Resource Description Framework (RDF) Model and Syntax", World Wide Web Consortium, http://www.w3.org/TR/WD-rdf-syntax/
[XML2] Bray, Tim and Dave Hollander and Andrew Layman (eds.), "Name Spaces in XML", W3C XML Working Group White Paper 15-October-1997, http://www.textuality.com/xml/xml-names.html