CS 431: Architecture of Web Information Systems, Spr. 2004

CS 431
Architecture of Web Information Systems
Spring 2005

Projects

Programming Projects

The projects are designed to give students the chance to build information environments using tools, techniques, and mechanisms covered in the lectures. In general, the assignments will require students to do some design work, understand relevant protocol or specifications documents, and write a moderate amount of java code and XSLT code.

Students will work in groups of two on these assignments. There will be an opportunity for group formation at the beginning of the semester and groups will remain together for the remainder of the semester. Members of the group are expected to share information and design ideas, jointly understand protocol documents and APIs, and write the final code product. Groups must work independently of each other and any evidence of copying of ideas, designs, code, etc. will be considered an academic integrity violation. Grades will be awarded based on the final product of the group and each student's contribution to the work of the group.

Prerequisites

The assignments assume that students can program in Java and understand how to download and use class libraries. No java or programming tutorials will be offered. Assignments will also depend on XSLT coding, which will be introduced in lecture. However, there will not be detailed XSLT programming tutorials. Students are expected to use on-line materials or available books for this.

Grading Criteria

This is not a programming course. Imaginative algorithms or data structures will not be required or play a role in grading. Instead, grading will be based on completion of the assigned task, demonstrated understanding of the concepts and protocols underlying the assignment, and project design. Nevertheless, assignments should demonstrate good programming practices and documentation commensurate with the 400 level of this course.

Programming Environment

Programming assignments should be done using the Eclipse IDE. This is available for free from http://www.eclipse.org for all major operating systems. Submissions will be in the form of Eclipse projects.

Tools

Working with XML, XSLT, and the like is considerably easier if you don't have to worry about syntactic details. I highly recommend that you use oxygen, a very nice environment for working in the XML/XSLT world.. It is available for a 30-day free trial and has a very attractive academic license fee. It also integrates as a plug-in for eclipse. Oxygen runs on Windows, Mac OS X, and linux!

Submitting Assignments

All assignments are due as listed on the course calendar. NO LATE ASSIGNMENTS WILL BE ACCEPTED.

To identify your assignments and make grading easier, assignments MUST conform to the following guidelines. :

Each group should identify a group leader when they form. That persons name will serve as the "firstlast" in the remainder of these instructions.
The Eclipse project must be named as firstlastassignment# (e.g., CarlLagozeAssignment1)
The first executable line of the program should be System.out.println("TeamMember1, TeamMember2")
The project directory should be submitted as one ZIP file that must be named firstlastassignment#.zip (e.g., CarlLagozeAssignment1.zip)
The assignment should be submitted via the submission links below.

Submissions that fail to conform to these guidelines will be rejected.

Project 1

Due: March 18, 11:55 PM

Overview

Based on your A+ grade in CS431, you have been hired by amazon.com to design and demonstrate a new metadata format for intellectual property items (books, music, DVDs, video, etc.) in their store. This metadata format will form the basis of their web site and web services. While they already have an existing XML-based metadata format, which is exposed via their web services API, they want to align with external standards such as Dublin Core and FRBR. They are especially interested in the FRBR data model, because it will give customers a better tool for browsing among related products. Your managers want a presentation of your work that includes a formal definition of your metadata model in the form of an XML schema, a way to programmatically transform (via XSLT) their existing XML metadata format for items into your proposed format, and a way to programmatically transform (via XSLT) from your proposed format into an HTML human-readable display of the data. A career of fame and fortune in the e-commerce industry awaits you if you can demonstrate your hard-earned skills to management.

Resources

[1] W3 Schools XML Schema Tutorial - http://www.w3schools.com/schema/schema_howto.asp

[2] W3 Schools XPath Tutorial - http://www.w3schools.com/xpath/default.asp

[3] TopXML XSLT Tutorial - http://www.topxml.com/xsl/tutorials/intro/

[4] XML Schema for Simple Dublin Core - http://dublincore.org/schemas/xmls/simpledc20021212.xsd

[5] Amazon E-Commerce Service 4.0 - http://www.amazon.com/gp/aws/sdk/104-8848828-4414330?

[6] oXygen XML Editor - http://www.oxygenxml.com

[7] Dublin Core Element Set - http://www.dublincore.org/documents/dces/

[8] IFLA Functional Requirements for Bibliographic Records - http://www.ifla.org/VII/s13/frbr/frbr.pdf

[9] Saxon XSLT and XQuery Processor - http://saxon.sourceforge.net/

[10] XML Schema Primer - http://www.w3.org/TR/xmlschema-0/

[11] dom4j Open Source XML framework for Java - http://www.dom4j.org/

Detailed Instructions

Review the specifications for the Amazon E-Commerce Service (ECS) [5]. You will find in this extensive documentation instructions for programmatically searching the Amazon database for items and the specifications and schema for XML-formatted result sets that are returned. You will need to register for use of ECS. Registration is available via a link from http://www.amazon.com/gp/browse.html/002-3738912-9258424?node=3435361. Registration is free. Experiment with the REST-based requests (you can ignore the SOAP calls for now), in particular ItemSearch, and examine the returned XML to understand the data supplied by Amazon for its items. You will find that oXygen [6] is a big help for examining the structure of XML files and the relationship between schema and their instances.
Review the Dublin Core elements set [7] and XML schema for Simple Dublin Core [4] to understand possible mappings between item metadata descriptions supplied by ECS and DC.
Review the IFLA FRBR entities (in particular Work, Expression, Manifestation, and Item) and their attributes, paying attention to the placement of DC elements and ECS item metadata elements in the FRBR model.
Design an XML schema for a metadata format that can be used for Amazon intellectual property items (books, music, DVDs, videos, etc.). Use the following guidelines for the design of your metadata format:
- It should be based on the FRBR work, expression, manifestation, item hierarchy. As such, it should be the basis of xml instance documents, each of which represents a single work with its associated sub-entities nested within it. All your metadata properties should be contained within the appropriate FRBR entity.
- It should be based on the metadata available from ECS for amazon intellectual property products. So, even though there are an infinite number of possible item attributes you could express in your metadata, take a look at what ECS supplies and think of how that information semantically fits within your metadata schema.
- It should use as many of the DC elements as possible and import the DC XML schema [4] to accomplish that. There are some pretty clear mappings between the elements in the ECS schema and DC.
- Your schema should make use of namespaces. It should include three separate namespaces: 1) for elements in the DC namespace, 2) for elements based on the FRBR model (use a namespace URI of http://www.ifla.org/frbr#), and 3) for elements that you create (you can formulate your own namespace URI).
- Use comments within your schema so we understand the reasons for your design decisions.
Write an XSLT document to transform the XML data returned from an ECS ItemSearch request into your metadata format. Note that each ECS ItemSearch call returns a result set with a number of product items in it. The REST API allows you to page through a full result set if you wish. For the purpose of this assignment, you should only process the first result page.
Write an XSLT document to transform your XML metadata format into a human viewable XHTML document. Their is no substantial reward for fancy aesthetics here - that is, get the information out in human readable form and save your artistic inclinations for a web page design course.
Write a java program that issues an ECS ItemSearch REST request based on some search criteria that will produce several product items that you can use as the basis for demonstrating your FRBR and DC based metadata format. For example, a search for "Shrek" produces several items in various genre that may be related to the work titled "Shrek". The program should then use XSLT (via Saxon [9], dom4j [11], or another library) to transform the ECS output to an instance of metadata that is valid according to your schema and transform your metadata instance into a human visible web page. Thus, each run of your program should:
- Issue one (or several) ECS requsts.
- Use two XSLT documents
  - transform ECS to your metadata instance
  - transform your metadata instance to XHTML
- Output two XML files:
  - a metadata instance that is valid according to your XML schema
  - an XHTML file that is valid according to the XHTML DTD.
Note: the quantity of amazon products included in your metadata instance document is NOT a factor in grading criteria. That is, if you pick a search criteria that produces lots of item results (such as "War and Peace") you don't have to go through all the results pages. Try to come up with a search that produces items in multiple genre on the first page.

What you should turn in

You should submit, via the course Blackboard web site, a single zip file of your eclipse project directory. This project directory should include:

Java source file
XML schema for your metadata format
XSLT document to transform from ECS to your metadata format.
XSLT document to transform from an instance of your metadata format to XHTML
XML instance of your metadata format
XHTML instance of human view of your metadata format.

What you will be graded on

Objective criteria for grading:

well-formedness of XML documents
validity of instance documents to schema and DTDs
ability to run your code and produce output as specified

Subjective criteria for grading:

Demonstration of logical design decisions in schema
Demonstration of FRBR and DC principles
Professionalism in project assembly and packaging

Criteria NOT considered:

Quantity of Amazon products included in your metadata instance
Aesthetics of your produced web page (although outright unreadability of the page will be penalized)

Updates (Friday, March 11, 2005):

You must use 3 namespaces, and optionally 4. The 3 namespaces you must use are the amazon namespace, the dc namespace, and the frbr namespace.
The fourth, optional namespace, is for any elements that you decide are necessary in the creation of your schema.
The amazon namespace has a date stamped into it; don't worry about this. This exists so that amazon can release revisions of their namespace and this doesn't happen very often.
The XSD document describing the FRBR namespace can be found at http://www.cs.cornell.edu/~ags/documents/cs431_frbr_spr05.xsd.
You can specify a default action for XSLT by matching a template to the "*" pattern.

Project 2

Due

Due: May 13, 11:55 PM

Overview

Congratulations! Your managers at amazon.com were impressed by your metadata design and XML translation work. You have been promoted to Ontology and Data Model Architect for the next generation of Amazon.

Since Amazon's launch in 1995, the complexity of its business and the information it manages has grown immensely. From selling just books, Amazon now is not only a shopping center for a diverse set of products, but is also a collaborative space that relates products, consumers, and producers in many ways. Amazon's rapid growth has led to a situation where the entities and their relationships that are shown on the amazon.com pages are managed in a ad-hoc fashion. Your managers are intrigued by a briefing you gave them on semantic web technology. They would like you to prototype additional work in this area to demonstrate the possibility of a complete redesign of the Amazon back-end based on these technologies. Specifically, they would like you to:

Develop an initial ontology in OWL that provides a meta-model for the entities and relationships within amazon.com. Remember, this is prototype work and you don't need to model the entire information space! But, your ontology should at a minimum include the following notions:
1. Agents and their sub-types: people and organizations who create and do things. Some examples of agents are authors, musicians, publishers, reviewers
2. Products and their sub-types: the stuff that Amazon sells. For this prototype you can limit your sub-types to products that are intellectual content such as books, DVDs, music, and the like
3. Lists and their sub-types: the various aggregations shown on amazon.com. This includes lists of similar products, ListMania lists, etc..
Model an instance of your ontology using Fedora and Amazon ECS. Your resulting Fedora implementation should have the following characteristics:
1. It should contain digital objects for:
  1. Classes, and sub-classes, defined in your ontology.
  2. Instances of classes (books, reviewers, etc.) defined in your ontology. These instance digital objects should correspond to items in amazon.com accessible via ECS "lookup" operations. Your repository should include a subset of the entities on two amazon.com web pages for products in two genre by one creator. An example is the author and musician James McBride who has a wonderful book "The Color of Water" and a nice Jazz CD "Process 1". You do NOT need to instantiate every list item, reviewer, etc. on the pages your choose - just create enough digital objects to demonstrate your ontology concepts and their relationships.
2. It should define relationships among the digital objects including:
  1. The sub-class relationships between classes, using the rdf:subClassOf property.
  2. The relationships among the various amazon.com items, using by the properties in your ontology.
  3. The types of your amazon.com entities to appropriate classes, using the rdf:type property.
3. The digital objects for the two products should produce xHTML disseminations that are based on output from the corresponding ECS request and queries to the Fedora relationship index that return relationships among digital objects. For example, using the James McBride example again, your digital object corresponding to "The Color of Water" could disseminate a web page giving a cover page for the book displaying some metadata and then links to reviews, lists, etc. In effect, your Fedora repository should produce a new amazon.com web site using the disseminations from the Fedora repository and the ECS calls as a basis.

Resources

[1] Fedora home page - http://www.fedora.info.

[2] Fedora installation and configuration guide - http://www.fedora.info/download/2.0/userdocs/distribution/installation.html

[3] Fedora Tutorial 2 - http://www.fedora.info/download/2.0/userdocs/tutorials/tutorial2.pdf

[4] Fedora Demo Documentation - http://www.fedora.info/download/2.0/userdocs/distribution/demos.html

[5] Amazon E-Commerce Service 4.0 - http://www.amazon.com/gp/aws/sdk/104-8848828-4414330?

[6] oXygen XML Editor - http://www.oxygenxml.com

[7] Fedora Resource Index Search Service - http://www.fedora.info/download/2.0/userdocs/server/webservices/risearch/index.html

[8] Protege Ontology Editor - http://protege.stanford.edu/

[9] Racer Reasoner - http://www.cs.concordia.ca/~haarslev/racer/download.html

[10] Protege Owl Tutorial - http://www.co-ode.org/resources/ tutorials/ProtegeOWLTutorial.pdf

Detailed Instructions

Review the specifications for the Amazon E-Commerce Service (ECS) [5]. Experiment with ItemLookup, CustomerContentLookup, and ListLookup operations to understand their output. You will find that oXygen [6] is a big help for examining the structure of the XML responses to these calls.
Install Fedora 2.0 [1][2]. The software will install quite easily on Windows XP, Mac OSX, or various flavors of Linux. The easiest installation configuration is to use the mckoi database, which comes packaged with Fedora. You shouldn't need to change any of the configuration defaults (paths, passwords, ports, etc.)
Before starting fedora edit the config file at <fedorahome>/server/config/fedora.fcfg. Find the setting for the pidNameSpace parameter and change its value from "changeme" to the netid of your project leader. In the next line in the config file (where retainPIDs is set), also change the "changeme" token to this new value. You MUST do this so we will be able to grade your project!!
Start up Fedora and run through the Fedora tutorial [3]. After finishing this tutorial, you should be comfortable with basic Fedora concepts needed for the assignment.
Download the Fedora Image Collection Demo [4] to help you understand how to encode relationships in Fedora and embed queries to the relationship index in datastreams, and use them in disseminators. You might want to experiment with the Fedora Resource Index Search Service [7] to further understand how this is done.
If you are using your own machine, download the Protege Ontology Editor (you should install version 3.1 beta, with all plug-ins).
Complete information about Protege and OWL is available in the Protege Owl Tutorial [10]. You shouldn't need to run through this entire tutorial, since the course lecture should provide you with sufficient background.
Design your Amazon ontology as a Protege "OWL Files" Project. As stated above you do not need to model the entire Amazon information space, but do need to include concepts like Agents, Products, and Lists.
Download the Racer reasoner [9] to test that your resulting ontology is consistent. (Note that Racer runs on port 8080 and will not run concurrently with Fedora).
Pick two Amazon web pages, related to each other by the same creator; e.g., the James McBride example mentioned above. These pages will provide the basis of your Fedora implementation.
Download and import the following two digital objects, which you will use in your project:
1. http://www.cs.cornell.edu/courses/cs431/2005sp/Projects/Project2/Proj2Bdef.xml - This is BDef with one operation, Query.
2. http://www.cs.cornell.edu/courses/cs431/2005sp/Projects/Project2/Proj2Bmech.xml - This is a BMech, refining the above BDef. The specification of the BMech is as follows:
  1. It takes one datastream parameter, with MIME type plain/text, that is an ITQL query to the resource index.
  2. It returns a SPARQL XML document that is the response to the query.
Create the digital objects to demonstrate your model using the Fedora administrative interface to create the digital objects. Note that you could also use an XML editor to create the raw FOXML objects and ingest them, but this is probably harder than creating them through the UI. Do NOT spend the time creating objects for all the entities on your selected Amazon web pages. You should only create enough objects to demonstrate the concepts in your ontology. So for example, you only need to select a couple of reviews and list items from your pages. Your digital object design should incorporate the following features
1. Digital objects corresponding to classes and sub-classes in your ontology should include:
  1. Dublin Core descriptions where the title is the name of the class, type is "owl:class: and identifier is the URI of the class. You MUST fill in this Dublin Core metadata in this manner to help with our grading.
  2. RELS-EXT rdf fragments expressing class/subclass relationships (using rdf:subClassOf).
2. Digital objects corresponding to Amazon items should include:
  1. Dublin Core descriptions where the title is the name of the item, the creator is the name of your project group leader, the type is the URI of the respective class in your ontology, and in the case of products the Dublin COre identifier is the URL of the Amazon page corresponding to the item. You MUST fill in this Dublin Core metadata in this manner to help with our grading.
  2. A datastream that is a redirect to the amazon.com ECS call "lookup" call for the item, with ResponseGroup set to medium for products and small for other Amazon entities.
  3. RELS-EXT rdf fragments connecting the item to its class digital object (using rdf:type) and connecting the item to related items using properties in your ontology (e.g. connecting reviews to an item).
3. Digital objects corresponding to Amazon items that are products (e.g., the book, DVD, CD, etc.) should have the following characteristics:
  1. They should have a datastream that is the query input for the Proj2Bmech that you imported in an earlier step. This query return the PIDS of the various items (author, reviews, etc.) related to this product.
  2. They should have a disseminator that employs the Proj2Bmech and consumes the query input datastream defined above.
  3. They should have an XSL datastream that prettyprints in xHTML the Amazon ECS response (don't go overboard with this). The xHTML should also contain a link to the dissemination produced by Proj2Bmech, providing the linkages of this product to reviews, etc. (Clearly this link will produce only the SPARQL xml result, but you could imagine using this as the input for additional xsl that prettyprinted this).
  4. They should have a disseminator that employs the built-in Fedora saxon service to consume this XSL datastream and the Amazon ECS response datastream to produce the xHTML (see section 7.1 of the Fedora tutorial [3]).
Export the two query input datastreams, which are included in your product Digital Objects. The exported files should be named <pid>.txt, where <pid> is the PID of the respective digital object.
Export your FOXML digital objects upon completion.

What you should turn in

You should submit, via the course Blackboard web site, a single zip file, which should include:

The OWL/XML file for your ontology.
Your exported FOXML objects for your Fedora repository
Your query input data streams produced in step 13 above.

What you will be graded on

You will be graded on:

Ontology Design: completeness and validity.
Fedora digital object design: correspondence to ontology, completeness, ability to execute as specified
Professionalism

You will NOT be graded on:

Quantity of amazon products represented in you repository.
Aesthetics of output beyond simple professionalism.

[CS 431 Home Page]

Carl Lagoze (lagoze@cs.cornell.edu)
Last changed: 02/20/2006