![]() |
|
*** Draft *** Report of the |
|
Report Slides |
2. Identification The session began with a presentation by Larry Lannom (CNRI) describing the Handle System [Handle]. It was followed by a discussion of issues surrounding identification. A summary of the presentation describing Handle is followed by a summary of the discussion on identification. 2.1 The Handle System The Handle system provides a resolution service to the object in a repository or collection. An object is given a name (a handle) which resolves to a location and any other current state data needed to obtain and use the object. When an object moves, the name is retained and the value(s) for the location is changed. Handles are part of CNRI’s Digital Object Architecture. The CNRI Handle system is a reference implementation of the Handle Specifications (IETF RFC 3650-3652), optimized for speed and reliability. Other implementations could be built from the specifications. The CNRI Handle System is a collection of Local Handle Servers with one Global Handle Registry (GHR) serving as the root for the system. Handle Services can have multiple sites, mirroring each other for redundancy. Each site has a collection of servers with the handles hashed across the servers. The Global Handle Registry is currently replicated across three sites, to reduce the risk of a single point of failure. A handle has two components: a prefix and suffix. The client that is used to resolve handles, which need only know about the GHR, takes the prefix and attaches it to the prefix for the global handle service. The GHR returns information about the sites for the handle service that provides resolution for the suffix. The client uses this information to select a server to contact to resolve the suffix for the original handle, returning the location and other data associated with the handle. Handle ownership and administration is at the handle level, not at the server level. Having the administration key is sufficient to administer the handle; no server privileges are required. Handle resolution and administration can be embedded in other processes, and handles can be embedded in URLs, but handle is not a URN scheme. There is no advantage to using URNs as you still need the protocol as well as the indirection process, and URN doesn’t provide the protocol. The key for effective use is to have clients embedded widely in web browsers. The largest handle prefix is the DOI, with 12,000,000 handles. There are over 1000 other prefixes representing an unknown number of handles. The CNRI Handle System processes 10,000,000 resolutions per month. The learning object community is one of two communities newly interested in the Handle System, the other being the GRID community. The GRID community is committed to an open source license, which would mark a major shift from current policy which currently emphasizes research and education and which has the Digital Object Identifier as the only commercial implementation. While it is not clear what the outcome will be, the shift to open source within the context of the GRID could bring in additional commercial interests. Combined with the learning object community, the prospects are for considerably increased usage beyond the scholarly publishing and digital library base now using the system. This raises issues of governance, which are being addressed through the proposed evolution of the Handle System Advisory Committee into a body that manages the system long term. [Section 2.1 is based on notes by Larry Lannom.] 2.2 Identifier Issues The discussion began with a list of potential issues.
Key discussion points from the session follow. What is an identifier? A simple definition of an identifier is that it is a name that represents an object. This definition does not associate any behaviors with the name. A key behavior, however, is that identifiers are used to denote when two things are the same. When and why do you care when two things are the same? The different answers to “why” it is important to know two things are the same depend on the context in which the identifiers are used. While identifiers are often just simple names for objects, it might be possible to embed more semantics in the identifier, e.g., to include versioning or metadata. Smart identifiers can help express relationships between objects, include date/time stamps, etc. But are these expressions the role of: identifiers; metadata associated with identifiers; or metadata services around the identifiers and their metadata? The benefits and drawbacks of smart identifiers have not been explored. It may not be helpful to talk about global uniqueness. We need to put bounds on uniqueness, e.g., “unique in the domain of: x”. What is identified? What does the identifier point to metadata for the object, a collection or aggregation, or the object directly? An abstract model is important in determining what is being identified. For example in one model the New York Times would have only one identifier. This might not be the most appropriate model as the paper has different content each day and different types of content over time. Much depends on the level of abstraction that you are dealing with. A different model for the New York Times might require four levels of identifier: the Times itself, an issue, an article or a page. There are different abstract models for what is identified. In LOM, there are a number of assertions about an object glued together by an identifier. An alternate model would be to embed the identifier and metadata in the object. The library world takes a different approach, e.g., the FRBR (Functional Requirements for Bibliographic Records) model. In the learning object world there is no clear model for versioning of content and association of identifiers with versioned content. Sometimes the identifier will be required to point to a specific version of a resource and sometimes it will point to the latest version of a resource. What is a Persistent Identifier? A persistent identifier is a globally unique actionable identifier that points to an object for all time. A persistent identifier tells you where and when the object is i.e., permanence and persistence over space/time. The issue of persistence is critical to decisions made regarding the development and implementation of services which rely on persistence or are built around metadata associated with identifiers. The discussion around persistence raised a number of questions and discussion points. How can any agency say that an identifier is persistent for all time? There are different perceptions of time by different communities, e.g., governments, publishers and archivists. There are different requirements for persistence for content and content management. How do owners make assertions regarding how long there is a guarantee of persistence or how long you can expect a resource to persist? Thus we need to differentiate between: persistence of the identifier, persistence of content and persistence of the resolution service. The issues associated with each are different. It may not be helpful to talk about persistence for all time as it is not possible to engineer a solution for all time. Better options are to talk about set bounds on time period (e.g., 10 years) or persistence over the lifecycle of the object. Once the length of the persistence is decided, then the other issues can be resolved. A question left to be answered is “which resources require persistent identifiers”? Other identifier issues A collection of specific issues were raised during the discussion:
Lessons from NSDL In the NSDL identifiers are used to identify metadata not resources. Identification and resolution should be treated as separate issues. How is the appropriate copy issue dealt with in the handle world? In the Appropriate Copy Project, an OpenURL is used to transport the DOI. Mechanisms for identifying local and other copies are provided as additional services built over the handle infrastructure. One of the reasons that people like OpenURL is that it takes data which can then be resolved depending on the resolving service rather than prestructuring a persistent identifier. Duplication of identifiers is a problem. In a system as heterogeneous as the NSDL any attempt to put semantics in identifiers is likely to fail. But in a small scale implementation it might work. CORDRA may fall on the boundary between small and large scale implementation and thus it might be possible to embedded semantics in a CORDRA identifier. Other Issues Low-cost implementation solutions more likely to be adopted. The cost of creating an identifier includes creating associated metadata, etc. Processes and workflow for creating metadata and assigning identifiers are required. We will need to determine who owns the identifier. CORDRA provides a repository view of content, but users will want to take an object and create a copy. What happens with metadata and identifiers when a copy of the object is made? [Section 2.2 is based on notes by Dan Rehak and Kerry Blinco.] |
wya
Last changed: June 11, 2004