Fedora Proposal:  Content Model Dissemination Architecture (CMDA)

Sandy Payette, Eddie Shin, and Chris Wilper (Cornell University)
Ross Wayland (University of Virginia)

March 2006


Table of Contents

I.    Problem Definition

II.   CMDA - A New Way to Do Disseminators with Content Models

III.  Open Issues and Design Questions


I. Problem Definition

1.  Need formalized, built-in support for the notion of a "content model"

Individual digital objects can conform to an informal a "content model" (a descriptive property of a Fedora digital object).  This informal descriptor for is used to define that nature of a set of digital objects (as in the number and types of datastreams and disseminators).   For example, at University of Virginia, a set of content models have been defined for images, TEI texts, EAD finding aids, and more (see: http://www.lib.virginia.edu/digital/resndev/fedora_imp/content_models.htm).  As of Fedora 2.1, institutions developed their own sets of rules for content models, and enforced these rules either through "best practices" or through custom validation code within workflow applications or middleware on top of Fedora.    Thus, as of Fedora 2.1 there is no built-in support for content models in the repository, except that the content  model descriptor can be stored in an object, and thus searched upon.  This enables institutions to have a group identifier so they can locate objects that have the same datastream and disseminator patterns. 

2.  Need a better way to manage disseminators on objects

As of Fedora 2.1, digital objects can have one or more disseminators attached to them.  Each disseminator is pre-bound to an individual digital object (and stored as a component of the object in FOXML).  It has been observed that modification of pre-bound disseminators is awkward.  If a change to a disseminator must occur, each object must be individually modified if a disseminator is to be modified. Also, the current release of Fedora does not allow modification of BMech and BDef objects that lie behind disseminators.  This means that (1) you can't add a new method to a BDef/BMech, (2) you can't modify existing method definitions, and (3) you can't modify BMech binding maps.   The current reasons restriction are: (1) details related the current implementation of  the "replication" module that keeps the dissemination database tables up to date, and (2) over optimization of the dissemination database tables. 

While the current situation does not interfere with how disseminations work at runtime, and it does not prevent people for modifying their disseminators, the current design does not make it easy to do certain kinds of modifications.   First, if you want to modify an existing BDef or BMech object, you must purge it, make the changes, then re-ingest it.   This is easy enough, but since the system currently will not allow you to purge a BDef/BMech objects that are currently referenced in digital object disseminators (since the system forces referential integrity of an object's disseminators to BDef/BMech objects in the repository.   Thus, you are forced to take the following steps; (1) purge disseminators on every digital object that uses the BDef/BMech object that you want to change, (2) purge the BDef/BMech, (3) re-ingest the modified BDef/BMech, (4) add new disseminators back on the digital objects.   

Clearly this situation is not what was intended in the original Fedora design!  Too much enforcement of referential integrity, but not enough flexibility in terms of disseminator management.  

There are three basic goals that we put forth for Fedora 3.0 (development during 2006-2007):

1.  Formalize Content Models in the core Fedora repository service

2.  Allow easier management of Disseminators (and related BDef and BMech objects)

3.  Simplify and streamline the Fedora system by eliminating existing dissemination database and re-factoring the replication module

 

II.  CMDA - A New Way to do "Disseminators" with Content Models

The proposed Content Model Dissemination Architecture (CMDA) is intended to provide a looser binding of disseminators to digital objects by building disseminators around the notion of "content models."   A pre-requisite for this strategy is the formalization of content models, and the registration of such content models as special Fedora digital objects known as Content Model (CModel) objects (similar to how Fedora now registers BDef and BMech objects).  A CModel object will hold the specifications about the number and types of datastreams that must exist in any "conforming" digital object.  

Up through Fedora 2.1, digital objects could have "disseminators" directly linked to them.   In the newly proposed CMDA, digital objects will not carry their own disseminators.  Instead, the objects will associated with a CModel object from which it will acquire compatible services (i.e., "disseminations").   The CMDA will also enable both "contractual" and "opportunistic" disseminations.   Contractual disseminations are behaviors that a digital object attains because it conforms to a particular CModel.  The CModel object has relationships to particular services (via relationships to BMech objects).   This is very similar to the current notion of a Disseminator in Fedora - it achieves the same result by in a more indirect manner.   In contrast, "opportunistic" disseminations are behaviors that an object can attain in a totally dynamic manner at runtime (via simple service matching algorithms).  This new way of defining behaviors for digital objects is described, step-by-step, below.

1.  Basic Relationships:  Object - CModel - BMech - BDef

In the proposed CMDA, regular digital objects in a repository can have a relationship to one or more CModel objects (assume conformance is validated).  CModel objects, in turn, have relationships to one or more BMech objects (which store service description and service binding metadata that are the building blocks for a set of run-time behaviors for "conforming" digital objects.   As always, a BMech object is related to a BDef object (a BDef defines a set of methods in the abstract; a BMech contains WSDL bindings to a concrete service that runs the abstract behaviors). 

If a regular digital object has a relationship with a CModel object , then that digital object have a transitive relationship to BMech objects related to the CModel.   The new CMDA will exploit these relationships to enable digital objects to attain behaviors at runtime.  Conceptually, it is as though regular digital objects inherit disseminators from their content models.   

 

2.  Content Model Objects (CModel)

Definition:  

Content Model (CModel) =  Datastream Composite Model  (required)  +  Relationship(s) to Behavior Model(s) (optional)

Content models (CModels) are stored as a special type of digital object in a Fedora repository.    These special objects are "control" objects in the way Fedora BDef and BMech objects are.   A CModel object is used to establish a set of constraints that other digital objects must object if they are said to be "conforming" to a content model.   Below is a simple design for CModel objects, dealing with both the modeling of datastreams, and the association of services (behaviors) with the model:  

      The CModel object contains:

       

It should be noted that if the a dsTypeModel element specifies that a particular type of datastream can appear more than once in a conforming digital object, then the datastream id for each instance must be unique within the conforming object.  In cases such as this, the conforming digital object will append the prescribed datastream id with (_N) and increment N for each instance, as below:

 

 

3.  Digital Objects that Conform to a CModel

In the CMDA, regular digital objects can assert that they meet the constraints of one or more content models.   It should be noted that in Fedora 2.1, a digital object had a core property for content model.  This is property was treated as a simple string and was not controlled in any way by the system.   This property has been used to identify an informal notion of  a content model meaning that it was an informal group identity for an object.   The Fedora system had no way to enforce conformance or do to anything but just index the property as a general descriptor for the object.  This property was not repeatable, thus in Fedora 2.1 an object could have only one informal content model.  

In the new CMDA, an object can conform to multiple content models (polymorphism of content model conformance).  In the new CMDA, the content model will become repeatable object-to-object relationship property of the object. The subject will be a digital object URI, the predicate is the "hasFormalContentModel" relationship, and the object is the URI of a CModel object.  This can be expressed in RELS-EXT as follows:

    <rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:fedora="info:fedora/fedora-system:def/relations-external#"
     xmlns:myns="http://www.nsdl.org/ontologies/relationships#"> 

     <rdf:Description rdf:about="info:fedora/demo:99">
        <fedora:hasFormalContentModel rdf:resource="info:fedora/cmodel:article-1"/>
        <fedora:hasFormalContentModel rdf:resource="info:fedora/cmodel:generic-document"/>
     </rdf:Description>

    </rdf:RDF>

 

4.  Option for "Non-Contractual" Behaviors on Digital Objects via Content Models

Thus far, we have discussed how CModel objects can assert contractual relationships to BMechs.  The net result is that, at runtime, any digital objects that are related to that content model ("conforming objects") will attain the behaviors of those BMech(s) that the CModel has asserted a contractual relationship to.   The CModel-to-BMech relationship "hasContractualBMech" controls what behaviors will get associated with digital objects that conform to the CModel.  

It should be noted that in figure X, above, BMech objects can also assert relationships to CModels (orange arrows labeled "isCompatibleWith" pointing from BMech to CModel).  We also noted that a CModel object may not have asserted a contractual relationship with all compatible BMechs.   BMechs that are compatible with a CModel, but that are not named by the CModel, have the potential to endow the CModel with behaviors, its just that the CModel has not explicitly modeled such a relationship.   (The BMech-to-CModel relationships are "incoming" arrows asserted outside the context of the CModel).  These incoming relationships mean that a BMech declares that it is compatible with a CModel.  The BMech  has the potential to provide run-time behaviors for objects that conform to the CModel, but the CModel has not explicitly "authorized" this relationship.  

It is possible to allow a CModel to declare that it will endorse these "non-contractual" relationships.   The CModel can do so by asserting a special property ("endorseNonContractualBMechs") and setting it to true or false.  This is done in RELS-EXT as follows:

    <rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:fedora="info:fedora/fedora-system:def/relations-external#"
     xmlns:myns="http://www.nsdl.org/ontologies/relationships#"> 

     <rdf:Description rdf:about="info:fedora/demo:99">
        <fedora:hasContractualBMech rdf:resource="info:fedora/bmech:2"/>
        <fedora:hasContractualBMech rdf:resource="info:fedora/bmech:3"/>
        <!-- Allow other compatible BMechs to bind with objects conforming to this CModel -->
        <fedora:endorseNonContractualBMechs>true</fedora:endorseNonContractualBMechs
     </rdf:Description>

    </rdf:RDF>

The CMDA will look for this property at runtime.  If the CModel says that it endorses non-contractual BMechs ("true"), the behaviors of such BMechs will be made available on conforming digital objects at runtime.   If the CModel says it will not endorse non-contractual BMechs ("false"), then the behaviors of such BMechs will not be bound to conforming digital objects.  

A question arises as to whether the endorsement of non-contractual BMechs can be controlled at either the repository configuraton level, or at the level of any single digital object.   We can have a repository configuration option that globally enables/disables the capability of the binding such BMechs at run time.   Also, from the perpspective of an individual digital object, it is always possible to have an XACML policy to permit only disseminations from certain BMechs.    Thus, in terms of controlling whether "non-contractual" BMechs can bind at runtime there are three options:

 

5.  Option for "Opportunistic" Behaviors on Digital Objects

It may be desirable to allow completely dynamic and opportunistic behaviors to be dynamically bound with digital objects at run time, outside of the context of CModels.   This functionality could be enabled by having an repository configuration option to allow MIME (or FormatURI) matching directly between digital objects and BMechs.   An example might be where a BMech was defined to do a generic service operation like convert PDF to HTML.   It would deal with any PDF, irrespective of the role of that PDF in a digital object.  This would be the dynamic association of a general utility behavior with a digital object.  This may be out of scope for CMDA, but it may be another sort of dynamic dissemination feature we may want to explore.  <Elaborate on this more, and discuss whether this is a desirable feature to support in Fedora.>

5. Database Schema

One of the most problematic aspects of the current dissemination architecture in Fedora is the dissemination database, particularly keeping it up-to-date as digital objects are added/modified/purged.   Our attempts to make these tables efficient have resulted in an over normalization of these tables that have proven inefficient in terms of purge (e.g., there are cases where entries in certain tables are shared by multiple digital objects making for expensive manual queries to maintain referential integrity among the tables).   Most concerning, however, is the replication module which is the code that keeps these tables up to date by replicating changes after every API-M transaction.    A combination of table design, referential integrity enforcement, and the replication module, has made it difficult for us to open up functionality to let people modify existing BDef and BMech objects.  The main issues pertain to code in place to prevent breaking disseminators on existing objects that use these BDef/BMech objects.  (More details later.)   In the mean time, we are currently undertaking an analysis of the database and the replication module to see if we can make immediate improvements to: (1) improve performance, and (2) enable modification of BDef/BMech objects. 

However, if we pursue the new content model dissemination architecture proposed in this document, we may be able to work around the existing problems (any possibly obsolete the existing dissemination database).   Below are links to diagrams of the existing database schema, and two possible new ones:

 

6.  CMDA Dissemination Algorithm (pulling it all together at run time)

(I'm writing up this section from my paper notes.  This will describe the processing steps and database queries for ListMethods and GetDissemination)

The basic idea is that a target digital object asserts that it conforms to a CModel.  A CModel object has a relationship to a BMech object.   The target digital object has datastreams with MIME types (and possibly format URIs).   The CModel has a dsComponentModel that describes a set of dsTypeModel elements for "abstract" datastreams, including the prescribed datastream IDs.  Each dsTypeModel also has one or more serviceBindingKey elements defined within it.  These are ultimately used to link the dsTypeModel elements (abstract datastreams) to the semantic keys in a BMech's dsbinding map.  Minimal information is kept in the relational database.  To enable the disseminator matching, the database records relationships of CModel objects to BMech objects.  Everything else is done dynamically.  

1.  ListMethods (API-A)

2.  GetDissemination (API-A)

3.  GetObjectHistory (API-A)

 

7.  Disseminator Migration Utility

If we decide that we either want to obsolete existing disseminators, or we want to give people the option migrate existing objects to the new CMDA, then we need an easy migration utility that will not require an entire export/ingest of objects.    It might be possible to create a utility that works like the repository rebuilder and crawls the FOXML sources.   Here is one way it might work:

1.  STEP 0:  seed the utility with the PIDs of a set of existing objects that are representative of all existing digital objects that have disseminators on them

2.  STEP 1:  auto-create CModel objects by reading representative digital objects.   For each unique disseminator, build a CModel object.  The dsCompositeModel will be driven off the disseminator, back referencing to datastreams in the representative digital object.  (The related datastreams become dsTypeModel elements in the dsCompositeModel and we can pickup MIME and formatURI from the representative object;  we can pick up serviceBindingKey values from the datastream binding map on the disseminator.  Assert the "hasContractualBMech" relationship in RELS-EXT of the CModel object) by grabbing the BMechPID off of the disseminator.   Record the CModel-BMech relationships in the database (varies dependent on which implementation approach we go with).

3.  STEP 2:  crawl all FOXML files.  For each Data Object, look at disseminators and check BMech PID.    If we have the correlation somewhere of what CModel objects are associated with which BMech objects, then we can: (1) remove the disseminator from the Data Object, and (2) assert the relationship to the CModel object (presumable in RELS-EXT), and (3) record the DataObject-CModel relationships in the database (varies dependent on which implementation approach we go with).

 

8.  Resource Index Implications

In Fedora 2.1, the RDF-based Resource Index contains a triple for every "stable" dissemination on digital objects.   A stable dissemination is defined by a behavior method that either (1) does not have any parameters, or (2) has parameters, but the parameter values are from a fixed domain.    The Resource Index is kept up to date, incrementally, as API-M add/modify/purge operations are committed.   In Fedora 2.1, each object has its own dissemination, so each specific dissemination for each object can be easily figured out directly from the disseminator on the object.  In the CMDA, we propose that disseminators are not put directly on each object, but instead, disseminations are figured out via the relationship the object has with one or more CModel objects.  

To support the CMDA,  we propose some modifications to the Fedora model in the Resource Index.  These modification will be as follows:

  1. insert CModel nodes in the graph
  2. assert relationships from digital objects to CModels
  3. assert relationships from CModels to BMech objects (and visa versa)
  4. do not "pre-inference" every dissemination on every object;  instead this can be figured out by querying the Resource Index differently

<insert diagram of new RDF model for Fedora objects using CMDA>

This modification will be an accurate reflection of the new CMDA in the Resource Index.   It will cut down on the number of triples in the Resource Index. Most importantly, it will also simplify the incremental updating algorithm for the Resource Index, so as to not pre-calculate all dissemination triples for all digital objects.   Costly updates could occur whenever a CModel or BMech object is modified (since this would involve expensive queries to see which digital objects were affected, and then lots of triple deletes/inserts to make sure that all dissemination triples on individual digital objects reflect changes to the CModel or BMech nodes).   It should be noted that this update scenario would eventually hit us in Fedora 2.1.  We just have not had to deal with it since Fedora 2.1 does not allow modification of BMech objects.   In any event, the CMDA will provide the opportunity to simplify the incremental updating of the Resource Index.  

In terms of evaluating the impact of dissemination triples not being "pre-calculated" in the Resource Index, we evaluated the changes to RI queries that would be necessary to get information about disseminations on objects.  Below are sample queries for discovering disseminations on digital objects using the new model.

Query: Determine which methods are supported by an object (demo:11)
-------------------------------------------------------------------
select $dissType
from <#ri>
where <info:fedora/demo:11> <hasCModel> $cModel
and $cModel <usesBMech> $bMech
and $bMech <implementsBDef> $bDef
and $bDef <definesMethod> $dissType

Query: Which objects' bdef:1/getDC dissems changed between time 1 and 3?
-------------------------------------------------------------------------
select $object
from <#ri>
where $bMech <implementsBDef> <info:fedora/bdef:1>
and $cModel <usesBMech> $bMech
and $cModel <datastreamType> $datastreamType
and $datastreamType <ID> $datastreamTypeID
and $object <hasDatastream> $datastream
and $datastream <ID> $datastreamTypeID
and $datastream <lastModifiedDate> $dsModDate
and $dsModDate <after> '1'
and $dsModDate <before> '3'

Query: More accurate version of above query
--------------------------------------------
select $object
from <#ri>
where $bMech <implementsBDef> <info:fedora/bdef:1>
and $bMech <hasMethodImpl> $methodImpl
and $methodImpl <semanticType> $semanticType
and $cModel <usesBMech> $bMech
and $cModel <datastreamType> $datastreamType
and $datastreamType <semanticType> $semanticType
and $datastreamType <ID> $datastreamTypeID
and $object <hasDatastream> $datastream
and $datastream <ID> $datastreamTypeID
and $datastream <lastModifiedDate> $dsModDate
and $dsModDate <after> '1'
and $dsModDate <before> '3'

9.  New End User Tools to Support CMDA:

III. Open Questions and Design Issues

1.  Open Issues for Database Schema:  

What is the minimal set of database tables necessary to support CMDA (with good performance)?  Most notably, the proposed CMDA database (see db schema options PROPOSED A and PROPOSED B) does not record binding information for individual digital objects (i.e., the relationships between datastreams of specific digital objects with BMech binding keys).  We expect the run time dissemination algorithm to be fast enough that it won't be a problem.  CMDA requires that whenever dissemination-oriented requests are made upon digital objects (e.g., listMethods, getDissemination) that object's FOXML must be parsed to get a list of datastreams in the object.  In general, the new algorithm for fulfilling disseminations may be I/O intensive if we go with miminal db tables and lean towards parsing FOXML for the target object, CModel object, and BMech object.  However, earlier JMeter tests and actual experience has shown that the SAX parsing approach performs well.   We need to test the new CMDA under load again to be sure.   If necessary, we can add database tables, but we must be careful not to recreate something that looks like what we have now with the dissemination database. 

2.  Open Issues for Traditional Disseminators:  Should we obsolete traditional disseminators?

3.  Open Issues for Validation (Referential Integrity of Object-to-CModel-to-BMech-to-BDef):

The CMDA's vigor referential integrity validation is something we want to discuss more.   There are pros and cons to both a loose approach and a very tight approach. 

4.  Open Issues for Content Model Objects:

5.  Open Issues for Historical Disseminations (via versioning)

6.  Open Issues for Export with Behaviors: 

7.  New Design Possibility:  Content Model Inheritance

<add here:  details on how to achieve CModel inheritance based on discussions in meeting>

8.  Evaluation Typical Process Flows in creating Objects, CModels, and BDef/BMechs