The projects are designed to give students the chance to build information
environments using tools, techniques, and mechanisms covered in the lectures.
In general, the assignments will require students to do some design work,
understand relevant protocol or specifications documents, and write a moderate
amount of java code and XSLT code.
Students will work in groups of two on these assignments. There will be an
opportunity for group formation at the beginning of the semester and groups will
remain together for the remainder of the semester. Members of the group are
expected to share information and design ideas, jointly understand protocol
documents and APIs, and write the final code product. Groups must work
independently of each other and any evidence of copying of ideas, designs, code,
etc. will be considered an academic integrity violation. Grades will be awarded
based on the final product of the group and each student's contribution to the
work of the group.
Prerequisites
The assignments assume that students can program in Java and understand how
to download and use class libraries. No java or programming tutorials will be
offered. Assignments will also depend on XSLT coding, which will be introduced
in lecture. However, there will not be detailed XSLT programming tutorials.
Students are expected to use on-line materials or available books for this.
Grading Criteria
This is not a programming course. Imaginative algorithms or data structures
will not be required or play a role in grading. Instead, grading will be based
on completion of the assigned task, demonstrated understanding of the concepts
and protocols underlying the assignment, and project design. Nevertheless,
assignments should demonstrate good programming practices and documentation
commensurate with the 400 level of this course.
Programming Environment
Programming assignments should be done using the Eclipse IDE. This is
available for free from http://www.eclipse.org
for all major operating systems. Submissions will be in the form of Eclipse
projects.
Tools
Working with XML, XSLT, and the like is considerably easier if you don't have
to worry about syntactic details. I highly recommend that you use
oxygen, a very nice
environment for working in the XML/XSLT world.. It is available for a 30-day
free trial and has a very attractive academic license fee. It also integrates
as a plug-in for eclipse. Oxygen runs on Windows, Mac OS X, and linux!
Submitting Assignments
All assignments are due as listed on the course calendar. NO LATE
ASSIGNMENTS WILL BE ACCEPTED.
Projects will be submitted as zip files to CMS. They MUST
conform to the following guidelines. :
- Each group should identify a group leader when they form. That persons
name will serve as the "firstlast" in the remainder of these instructions.
- The Eclipse project must be named as firstlastassignment# (e.g.,
CarlLagozeAssignment1)
- The first executable line of the program should be System.out.println("TeamMember1,
TeamMember2")
- The project directory should be submitted as one ZIP file.
Submissions that fail to conform to these guidelines will be rejected.
Due: April 3, 11:59 PM
Overview
Amazon.com and Library of Congress have just signed an agreement to work on
several joint projects together. Based on your A+ grade in CS431
that demonstrates your proficiency with metadata and information modeling, you have been hired by amazon.com
to work on these projects. Your managers have asked you to
design and prototype a system that uses FRBR modeled metadata to relate resources in LC's digital collections
to books available for sale from Amazon. To do this you will 1) design a
schema that models your new metadata format, 2) harvest metadata from LC that
describes books that have been digitized by the library, 3) find matching
physical books at amazon using their web service API, and 3) programmatically
transform the metadata that describes for these instances of the same
intellectual entity into your FRBR based metadata model. The result will be a
set of metadata records in XML that represent the related resources. A career of fame and fortune in the
e-commerce industry awaits you if you can demonstrate your hard-earned skills to
management.
Detailed Instructions
- Register for use of the Amazon E-Commerce Service (ECS). Registration is available via a
link from
http://www.amazon.com/gp/browse.html/002-3738912-9258424?node=3435361.
Registration is free.
- Review the specifications for ECS [5], in particular how to search for
products using REST requests to the ItemSearch operation. The ECS
pages provide extensive documentation of the format of requests and
responses, and include the schema for XML-formatted result sets that are
returned. Experiment with book searches using ItemSearch REST
requests, and examine the
returned XML to understand the data supplied by Amazon for its items.
(You should use the "medium" ResponseGroup value for your ItemSearch
requests). You
will find that oXygen [6] is a big help for examining the structure of the
XML. Pay particular attention to:
- The header information in the response document that indicates success,
failure, and cardinality of the result set.
- The information returned for each item, in particular the item
attributes.
- Review the OAI-PMH specification [12], in particular the description of
the ListRecords request. Formulate the request URL to harvest metadata
from the Library of Congress at baseURL
http://memory.loc.gov/cgi-bin/oai2_0, narrowing the request to the set
lcbooks in metadata format oai_dc with records available since
January 1, 2005. Examine the response to understand the format of the
metadata records returned by the harvest.
- Design an XML schema that has the following characteristics:
- It should specify a container for a list of metadata records.
- Each metadata record in the container should correspond to a work in the
FRBR sense. The schema should model each work as a container that
reflects the relationships between works, expressions, manifestations, and
items in the FRBR model.
- It should associate with each of the FRBR entities (work, expression,
manifestation, item) sub-elements derived from the Dublin Core and ECS
vocabularies that correspond to attributes of that intellectual entity.
In cases where there is a duplication of meaning between ECS and Dublin
Core, use the Dublin Core element. For example, both DC and ECS
include "title" - only include the DC title in this case. Your schema
should make use of the following metadata elements:
- Dublin Core: title, creator, subject, description, publisher, date,
type, identifier, language, coverage, rights.
- ECS: Binding, ISBN, NumberOfPages, Publisher, ASIN
- Your schema should make use of namespaces. It should include at
least three separate namespaces: 1) for elements in the DC namespace, 2) for
elements based on the FRBR model (use a namespace URI of
http://www.ifla.org/frbr#), and 3) for elements from ECS. If
you find it necessary to create your own elements, you should create a
fourth namespace (you can formulate your own namespace URI).
- Use comments within your schema so we understand the reasons for
your design decisions.
- Write a XSLT document that transforms the results of your harvest
request to Library of Congress to an XML document that validates according
to your schema. The transform should effectively place the DC elements
returned by the harvest request in their proper location in the FRBR
description.
- Write a java program that does the following:
- Issues the OAI-PMH request to the Library of Congress (see HTTPClient
[13].
- Processes the response using your XSLT document (see dom4j [11] or Saxon
[9]). The result will be an XML document the validates according to
your schema, but contains only data from the OAI-PMH to LC. Write this
intermediate xml document to a file called stage1.xml. This will be
part of the package you hand in.
- Iterates through the metadata records in the intermediate xml document.
For each metadata record:
- Search amazon using ECS for a book that matches the work referenced by
the record. Your search criteria can be title and author and the
criterion for a successful match can be when a single item is returned
in the result set from ItemSearch.
- For each successful, insert the ECS metadata that describes the Amazon
book as a FRBR entity in the xml document produced by your XSLT transform.
You should use DOM or XSLT as the mechanism for these tree insertions (see dom4j
[11] or Saxon [9]). You may NOT manipulate the XML file by doing
string insertions (i.e., treating the XML data as simple text).
- Write the final XML document to a file called stage2.xml.
What you should turn in
You should submit, via CMS, a single zip file of
your eclipse project directory. This project directory should include:
- Java source file
- XML schema for your metadata format
- XSLT document to transform from OAI-PMH response to your metadata format.
- XSLT document to process the integration of Amazon data and LC harvest
results (optional, you may do this in-line in your java program using DOM)
- The stage1.xml intermediate file
- The stage2.xml final file.
Your project should be configured so that we can run your program.
Please ensure that library linkages are defined in manner that makes your
project runnable on another machine.
What you will be graded on
Objective criteria for grading:
- well-formedness of XML documents
- validity of instance documents to schema
- ability to run your code and produce output as specified
Subjective criteria for grading:
- Demonstration of logical design decisions in schema
- Demonstration of FRBR and DC principles
- Professionalism in project assembly and packaging
Resources
[1] W3 Schools XML Schema Tutorial -
http://www.w3schools.com/schema/schema_howto.asp
[2] W3 Schools XPath Tutorial -
http://www.w3schools.com/xpath/default.asp
[3] TopXML XSLT Tutorial -
http://www.topxml.com/xsl/tutorials/intro/
[4] Guidelines for implementing Dublin Core in XML-
http://dublincore.org/documents/dc-xml-guidelines/
[5] Amazon E-Commerce Service 4.0 -
http://www.amazon.com/gp/aws/sdk/104-8848828-4414330?
[6] oXygen XML Editor -
http://www.oxygenxml.com
[7] Dublin Core Element Set -
http://www.dublincore.org/documents/dces/
[8] IFLA Functional Requirements for Bibliographic Records -
http://www.ifla.org/VII/s13/frbr/frbr.pdf
[9] Saxon XSLT and XQuery Processor -
http://saxon.sourceforge.net/
[10] XML Schema Primer -
http://www.w3.org/TR/xmlschema-0/
[11] dom4j Open Source XML framework for Java -
http://www.dom4j.org/
[12] Open Archives Initiative Protocol for Metadata Harvesting -
http://www.openarchives.org/OAI/openarchivesprotocol.html
[13] Apache Jakarta HTTP Client -
http://jakarta.apache.org/commons/httpclient/
Updates (March 10, 2006)
- A schema that defines the Dublin Core elements, for use in your schema,
is available at
http://dublincore.org/schemas/xmls/simpledc20021212.xsd. More
information on XML schema and DC is at
http://dublincore.org/schemas/xmls/qdc/2003/04/02/notes/.
- You might want to take a look at the schemata in the OAI-PMH
specification to understand type reuse and import in XML schema, especially
as related to Dublin Core.
- You will note that reuse of Amazon item type requires the appearance of
ASIN in your instance document, due to the nature of the globally exposed
type in the ECS schema at
http://webservices.amazon.com/AWSECommerceService/AWSECommerceService.xsd.
You will NOT be penalized, therefore, for the ASIN value appearing multiple
times in your FRBR hierarcy.
Updates (March 13, 2006)
1. When you define you schema you should consider the fact that metadata
elements are not unique to FRBR entity. For example, a Work with multiple
manifestations may have a title for the work (e.g., "War and Peace") and titles
for the manifestations (e.g., "War and Peace: The Movie" and "War and Peace: The
TV Show")
Due
Due: May 15, 11:59 PM
Overview
Congratulations! Your managers at amazon.com were impressed by your
metadata design and XML translation work. You have been promoted to
Ontology and Data Model Architect for the next generation of Amazon.
Since Amazon's launch in 1995, the complexity of its business and the
information it manages has grown immensely. From selling just books,
Amazon now is not only a shopping center for a diverse set of products, but is
also a collaborative space that relates products, consumers, and producers in
many ways. Amazon's rapid growth has led to a situation where the entities
and their relationships that are shown on the amazon.com pages are managed in
a ad-hoc fashion. Your managers are intrigued by a briefing you gave them
on semantic web technology. They would like you to prototype additional work in this
area to demonstrate the possibility of a complete redesign of the Amazon
back-end based on these technologies. Specifically, they would like you to:
- Develop an initial ontology in OWL that provides a meta-model for the
entities and relationships within amazon.com. Remember, this is
prototype work and you don't need to model the entire information space!
But, your ontology should at a minimum include the following notions:
- Agents and their sub-types: people and organizations who create and
do things. Some examples of agents are authors, musicians,
publishers, reviewers
- Products and their sub-types: the stuff that Amazon sells. For
this prototype you can limit your sub-types to products that are
intellectual content such as books, DVDs, music, and the like
- Lists and their sub-types: the various aggregations shown on
amazon.com. This includes lists of similar products, ListMania
lists, etc..
- Model an instance of your ontology using Fedora and Amazon ECS.
Your resulting Fedora implementation should have the following
characteristics:
- It should contain digital objects for:
- Classes, and sub-classes, defined in your ontology.
- Instances of classes (books, reviewers, etc.) defined in your
ontology. These instance digital objects should correspond to
items in amazon.com accessible via ECS "lookup" operations. Your
repository should include a subset of the entities on two amazon.com web
pages for products in two genre by one creator. An example is the
author and musician James McBride who has a wonderful book "The
Color of Water" and a nice Jazz CD "Process
1". You do NOT need to instantiate every list item, reviewer,
etc. on the pages your choose - just create enough digital objects to
demonstrate your ontology concepts and their relationships.
- It should define relationships among the digital objects including:
- The sub-class relationships between classes, using the
rdfs:subClassOf property.
- The relationships among the various amazon.com items, using by the
properties in your ontology.
- The types of your amazon.com entities to appropriate classes, using
the rdf:type property.
- The digital objects for the two products should produce xHTML
disseminations that are based on output from the corresponding ECS request and queries to the Fedora
relationship index that return relationships among digital objects.
For example, using the James McBride example again, your digital object
corresponding to "The Color of Water" could disseminate a web page
giving a cover page for the book displaying some metadata and then links
to reviews, lists, etc. In effect, your
Fedora repository should produce a new amazon.com web site using the
disseminations from the Fedora repository and the ECS calls as a basis.
Detailed Instructions
- Review the specifications for the Amazon E-Commerce Service (ECS)
[5]. Experiment with ItemLookup, CustomerContentLookup, and ListLookup
operations to understand their output. You will find that oXygen
[6] is a big help for examining the structure of the XML responses to these
calls.
- Install Fedora 2.1.1 [1][2]. The software will install quite easily
on Windows XP, Mac OSX, or various flavors of Linux. The easiest
installation configuration is to use the mckoi database, which comes
packaged with Fedora. Note that when you run the fedora-setup
utility as described in the installation instructions, you should use the
"no-ssl-authenticate-apim" setting.
- Before starting fedora, edit the config file at <fedorahome>/server/config/fedora.fcfg.
- Find the setting for the pidNameSpace parameter and change its value from "changeme"
to the netid of your project leader. In the next line in the config
file (where retainPIDs is set), also change the "changeme" token to this new
value. You MUST do this so we will be able to grade your project!!
- Change the "level" of the fedora.server.resourceIndex.ResourceIndex
module from "0" to "1" (WARNING! Your project will not work if you fail to
do this!)
- Start up Fedora and run through the Fedora tutorial [3]. After
finishing this tutorial, you should be comfortable with basic Fedora
concepts needed for the assignment.
- Download the Fedora Image Collection Demo [4] to help you understand how
to encode relationships in Fedora and embed queries to the relationship
index in datastreams, and use them in disseminators. You might want to experiment with the Fedora
Resource Index Search Service [7][8] to further understand how this is done.
- If you are using your own machine, download the Protege Ontology Editor
(you should install version 3.2 beta, with all plug-ins).
- Complete information about Protege and OWL is available in the Protege
Owl Tutorial [11]. You shouldn't need to run through this entire
tutorial, since the course lecture should provide you with sufficient
background.
- Design your Amazon ontology as a Protege "OWL Files" Project. As
stated above you do not need to model the entire Amazon information space,
but do need to include concepts like Agents, Products, and Lists.
- Download the Racer reasoner [10] to test that your resulting ontology is
consistent. (Note that Racer runs on port 8080 and will not run concurrently
with Fedora).
- Pick two Amazon web pages, related to each other by the same creator;
e.g., the James McBride example mentioned above. These pages will
provide the basis of your Fedora implementation.
- Download and import the following two digital objects, which you will
use in your project:
- http://www.cs.cornell.edu/courses/cs431/2006sp/Projects/Project2/Proj2Bdef.xml
- This is BDef with one operation, Query.
-
http://www.cs.cornell.edu/courses/cs431/2006sp/Projects/Project2/Proj2Bmech.xml
- This is a BMech, refining the above BDef. The specification of the
BMech is as follows:
- It takes one datastream parameter, with MIME type plain/text, that is an
ITQL query to the resource index.
- It returns a SPARQL XML document that is the response to the query.
- Create the digital objects to demonstrate your model using the Fedora
administrative interface to create the digital objects. Note that you
could also use an XML editor to create the raw FOXML objects and ingest
them, but this is probably harder than creating them through the UI. Do
NOT spend the time creating objects for all the entities on your selected
Amazon web pages. You should only create enough objects to demonstrate
the concepts in your ontology. So for example, you only need to select
a couple of reviews and list items from your pages. Your digital
object design should incorporate the following features
- Digital objects corresponding to classes and sub-classes in your
ontology should include:
- Dublin Core descriptions where the title is the name of the class, type
is "owl:class: and identifier is the URI of the class. You MUST fill
in this Dublin Core metadata in this manner to help with our grading.
- RELS-EXT rdf fragments expressing class/subclass relationships (using
rdf:subClassOf).
- Digital objects corresponding to Amazon items should include:
- Dublin Core descriptions where the title is the name of the item, the
creator is the name of your project group leader, the type is the URI of the
respective class in your ontology, and in the case of products the Dublin
Core identifier is the URL of the Amazon
page corresponding to the item. You MUST fill in this Dublin Core
metadata in this manner to help with our grading.
- A datastream that is a redirect to the amazon.com ECS call "lookup" call
for the item, with ResponseGroup set to medium for products and small for
other Amazon entities.
- RELS-EXT rdf fragments connecting the item to its class digital object
(using rdf:type) and connecting the item to related items using properties
in your ontology (e.g. connecting reviews to an item).
- Digital objects corresponding to Amazon items that are products (e.g.,
the book, DVD, CD, etc.) should have the following characteristics:
- They should have a datastream that is the query input for the Proj2Bmech
that you imported in an earlier step. This query return the PIDS of
the various items (author, reviews, etc.) related to this product.
- They should have a disseminator that employs the Proj2Bmech and consumes
the query input datastream defined above.
- They should have an XSL datastream that prettyprints in xHTML the Amazon
ECS response (don't go overboard with this). The xHTML should also contain a
link to the dissemination produced by Proj2Bmech, providing the linkages of
this product to reviews, etc. (Clearly this link will produce only the
SPARQL xml result, but you could imagine using this as the input for
additional xsl that prettyprinted this).
- They should have a disseminator that employs the built-in Fedora saxon
service to consume this XSL datastream and the Amazon ECS response
datastream to produce the xHTML (see section 7.1 of the Fedora tutorial
[3]).
- Export the two query input datastreams, which are included in your
product Digital Objects. The exported files should be named <pid>.txt,
where <pid> is the PID of the respective digital object.
- Export your FOXML digital objects upon completion. (IMPORTANT!!!
When you export your objects, make sure that your "export CONTEXT" is
"archive"!!!)
What you should turn in
You should submit, via the course Blackboard web site, a single zip file,
which should include:
- The OWL/XML file for your ontology.
- Your exported FOXML objects for your Fedora repository
- Your query input data streams produced in step 13 above.
What you will be graded on
You will be graded on:
- Ontology Design: completeness and validity.
- Fedora digital object design: correspondence to ontology, completeness,
ability to execute as specified
- Professionalism
You will NOT be graded on:
- Quantity of amazon products represented in you repository.
- Aesthetics of output beyond simple professionalism.
Project 2 Resources
[1] Fedora home page -
http://www.fedora.info.
[2] Fedora installation and configuration guide -
http://fedora.info/download/2.1.1/userdocs/distribution/installation.html
[3] Fedora Tutorial -
http://www.fedora.info/download/2.1/userdocs/tutorials/tutorial2.pdf
[4] Fedora Demo Documentation -
http://www.fedora.info/download/2.1.1/userdocs/distribution/demos.html
l
[5] Amazon Web Services -
http://www.amazon.com/gp/browse.html/103-7308336-1981455?node=3435361&
[6] oXygen XML Editor -
http://www.oxygenxml.com
[7] Fedora Resource Index Search Service -
http://www.fedora.info/download/2.1/userdocs/server/webservices/risearch/index.html
[8] Fedora Digital Object Relationships -
http://www.fedora.info/download/2.1.1/userdocs/digitalobjects/introRelsExt.html
[9] Protege Ontology Editor -
http://protege.stanford.edu/
[10] Racer Reasoner -
http://www.racer-systems.com/index.phtml
[11] Protege Owl Tutorial -
http://www.co-ode.org/resources/ tutorials/ProtegeOWLTutorial.pdf
To make it easier for us to grade your project you MUST do the following:
- Download the zip file from
http://www.cs.cornell.edu/courses/cs431/2006sp/Projects/Project2/ExportAsArchiveBinaryPatch.zip.
- Extract the zip file into your FEDORA_HOME directory (in Windows this
will be probably be c:\fedora-2.1.1). This will replace 11 files in
your fedora binary distribution (you will be asked 11 times if you want to
replace an existing file, to which you should reply "yes"). This will
update your Fedora client and server. MAKE SURE YOUR FEDORA SERVER IS
STOPPED WHEN DOING THIS UPDATE!
- In step 14 of the detailed instructions when you export your FOXML digital objects upon completion.
make sure that your "export CONTEXT" is "archive"!!!