CS 431: Architecture of Web Information Systems, Spr. 2004

CS 431
Architecture of Web Information Systems
Spring 2004

Assignments

The reaction paper assignments are structured as follows: you should cover at least two closely related papers relevant to the current section of the course. One of the papers should be from the course syllabus (assigned for discussion section on which the paper is due or the two preceding sections). Another should be a related paper that you discover via another method such as references in the papers you have read, searching on Google, ResearchIndex, via the library gateway, or from other information source. Think of finding this paper as a mini resource discovery exercise. You should then write approximately 3-4 pages (approximately 1500-2000 words) in which you address the following points:

What is main content of the papers?
Why is it interesting in relation to the course, reflected in both readings and lecture?
What are the weakness of the papers, and how could they be improved?
What are some promising further research questions in the direction of the papers, and how could they be pursued?

Reaction papers should not just be summaries of the papers you read; most of your text should be focused on synthesis of the underlying ideas, your own perspective on the papers, and thinking on how the content of the papers relates to the overall content of the course. Reaction papers should be done individually (i.e. not in groups).

The reaction papers will be graded on a 12 point scale, with points allocated in the following categories:

Choice of papers (2 points) - Points will be awarded based on the scholarly nature of the second paper that is chosen and its relationship to the course content and to the paper selected from the syllabus.
Presentation (2 points) - Points will be awarded based on clarity in preparation and coherence of ideas presented.
Content understanding and summarization (4 points) - Points will be awarded based on the demonstrated understanding of the content of the two papers and that way in which that understanding demonstrates an understanding of the course content in general..
Synthesis (4 points) - Points will be awarded based on the depth of analysis of the relationship between the papers, critique of their content, and integration into the issues raised by the course in general.

Submission procedure for reaction papers is as follows:

A physical version of the paper should be handed in at discussion section on the due date.
An electronic copy of your physical submission should also be sent via email attachment and should be addressed to lagoze@cs.cornell.edu, ags@cs.cornell.edu. The subject of the email should be formatted as <your name>:Reaction:<due date>, as in 'Carl Lagoze:Reaction:2004-02-22'. The date/time stamp of this electronic submission will provide verification of your submission. This must be by the beginning of discussion section. Late submissions will not be accepted. Permitted formats are Word and PDF.
Please state at the beginning of the paper bibliographic references to the two papers discussed therein. You should format these references according to the IEEE reference formats at http://www.computer.org/author/style/refer.htm.

Reaction paper due dates are on the syllabus.

Programming Projects

Programming projects tentative due dates are April 5 and May 17.

The projects are designed to give students some practical experience in dealing with the technologies that make the Web and digital libraries work. In general, the assignments will require students to understand relevant protocol or specifications documents and write a moderate amount of java code that demonstrates an understanding of those specifications.

These assignments are not mainly a test of your programming skills. Rather they meant to encourage you to read protocol specifications and understand the APIs that implement them. In the real world this is not done in isolation. Thus, students are expected to work in groups on these assignments. At the beginning of the semester the class will break up into groups of 2 that will remain together for the remainder of the semester. Members of the group are expected to share information, jointly understand protocol documents and APIs, and write the final code product. Grades will be awarded based on the final product of the group and each student's contribution to the work of the group.

Prerequisites

The assignments assume that students can program in Java and understand how to download and use class libraries. No java or programming tutorials will be offered.

Grading Criteria

This is not a programming course. Imaginative algorithms or data structures will not be required or play a role in grading. Instead, grading will be based on completion of the assigned task and demonstrated understanding of the concepts and protocols underlying the assignment. Nevertheless, assignments should demonstrate good programming practices and documentation commensurate with the 400 level of this course.

Programming Environment

Programming assignments should be done using the Eclipse IDE. This is available for free from http://www.eclipse.org for all major operating systems. Submissions will be in the form of Eclipse projects.

Tools

Working with XML, XSLT, and the like is considerably easier if you don't have to worry about syntactic details. Fortunately, there are a number of excellent tools available to avoid this. Two that I recommend are:

xmlspy. You may download it onto your personal machine for a 30-day free trial, which may be renewed. The purchase price is ridiculously expensive! This tool is available for Windows only.
oxygen. Also available for a 30-day free trial. It has a very attractive academic license fee. Also integrates as a plug-in for eclipse. Plus runs on Windows, Mac OS X, and linux!

Submitting Assignments

All assignments are due by 11:59PM on the due date. NO LATE ASSIGNMENTS WILL BE ACCEPTED.

To identify your assignments and make grading easier, assignments MUST conform to the following guidelines. :

Each group should identify a group leader when they form. That persons name will serve as the "firstlast" in the remainder of these instructions.
The Eclipse project must be named as firstlastassignment# (e.g., CarlLagozeAssignment1)
The first executable line of the program should be System.out.println("TeamMember1, TeamMember2, TeamMember2")
The assignment should be submitted to CMS at http://cms.csuglab.cornell.edu.

Submissions that fail to conform to these guidelines will be rejected.

Assignment 0 - Due 2/6/2003

The purpose of this assignment is to ensure that you are familiar with the assignment submission process. It will not be graded but your submission of it registers the existence of your project group.

Resources for assignment 0

Sample submission zip file - here.

Directions for assignment 0

Write a java program that prints to the console two lines:

firstlastassignment0
The three names of group members separated by commas

Assignment 1 - Due 4/5/2004 (11:59PM)

In this assignment you will harvest Dublin Core metadata via the OAI-PMH, transform that metadata via XSLT to conform to a new FRBR-based schema that you design, and publish that metadata via RSS 1.0,

Resources for assignment 1

Jakarta Project HttpClient - a pure java implementation of the HTTP protocol.
Open Archives Protocol for Metadata Harvesting - application-independent interoperability framework based on metadata harvesting.
Apache Xalan-Java - an XSLT processor for transforming XML documents (the distribution includes Apache Xerces for XML parsing).
HTTP 1.1 Specification - IETF RFC 2616.
W3C RDF Page - The root page for the W3C RDF and Semantic Web effort. Contains links to all RDF and related specifications. The primer is especially useful.
The FRODO RDFSViz Tool - provides a visualization service for ontologies represented in RDF Schema.
RSS Tutorial for Content Providers and Publishers - A useful starting point for RSS 1.0.
RDF Site Summary (RSS) 1.0 - Lots of RSS information including the specification.
RSS 1.0 Validator - A tool to help check the correctness of your RSS feed.

Directions for assignment 1

The first part of the assignment involves some modeling work based on Dublin Core and FRBR. The DC properties have been criticized because they are a simple flat list. Semantically the properties can be partitioned among the four entities in the IFLA FRBR entity model: work, manifestation, expression, and item. Write a new RDF schema (expressed in RDF/XML) that expresses the four classes of resources expressed by the FRBR, expresses properties to associate the FRBR entities with the described resource, and then associates the respective Dublin Core properties with the proper FRBR entity via domain constraints. You should include the schema in your submitted zip file with the name dc_frbr.rdfs. You should include comments in your rdfs file sufficient to justify your modeling decisions.
Building on this modeling work, you should then write a single java program that takes no arguments and does the following:
- Harvest metadata from baseURL http://services.nsdl.org:8080/nsdloai/OAI. You should restrict your harvest to the set 'arXiv:org' and metadata format 'nsdl_dc' and to records that are new since June 1, 2003. You can do a single harvest, ignoring the resumptionToken (indicating that there is another group of records to harvest for this request).
- Transform the harvested metadata into an RSS 1.0 channel that contains an item for each OAI record harvested and which translates the harvested metadata to conform to the new schema you designed in part 1.
- Write out the resulting RSS/XML channel as a file called RSS.xml.

Guidance for assignment 1

This assignment really doesn't require a significant amount of programming. The bulk of the work is understanding the schema design, protocol specifications, APIs, and tools such as XSLT. Much of the material will be introduced in lecture over the next few weeks. I'd recommend, however, that you get an early start by looking at and downloading the relevant resources and experimenting with them. Before writing the XSLT transformation, I recommend manually (using Oxygen) writing a trial RSS 1.0 channel to see what you are headed towards.

Assignment 2 - Due 5/17/2004 (11:59PM)

In this assignment, you will integrate Fedora and Jena to provide a metadata repository for various entities and reflect the relationships among those entities in a Jena model. The entities (content) that you work with be based on a simple modeling of information on Amazon.

Resources for assignment 2

Fedora project home page and download site - content management system for your metadata repository
Jena semantic web framework - software for manipulating RDF models
Representing vCard Objects in RDF/XML - instructions on how to express agent/person information in RDF.
IsaViz: A Visual Authoring Tool for RDF
JDOM: An easy DOM-based library for building and manipulating XML documents.

Directions for assignment 2

Pick a person who is the creator of both books and music on amazon.com. An example of such a person is James McBride who has authored books, one which is the fantastic "Color of Water", and is a jazz musician. You can use McBride or any other person as long as s/he has creations in two very different genre of materials. One other restriction is that amazon.com should have at least one or two reviews for the books and music created by your chosen person (this shouldn't be hard to meet since there are reviews for virtually everything on amazon.com).
Create a simple ontology expressed in RDF-s that provides the framework for describing the class/sub-class and property/sub-property relationships in the information from amazon. This does not have to be very complex and only needs to express the following structure:
1. There are two genres of creations: CDs and Books.
2. People can have three roles: author, musician, reviewer
3. There are properties that express the relationships among people in these roles and their creations.
You should include the schema in your submitted zip file with the name amazon.rdfs.
Set up a fedora content repository. Create digital objects for the following entities:
1. The person that is the creator of the books and music.
2. At least one of the books created by this person and at least one of the CDs.
3. At least two of the reviewers of these resources.
4. At least one review from each of these people.
Set up data streams for these objects as follows:
1. For the content (reviews, books, music), fill in the default Dublin Core record with information for the content resource. Don't get carried away with the completeness of the DC record. A minimal amount of information to describe the content (e.g., creator, title, subject, type) is enough.
2. For the people, create an addition data stream that is a simple vcard record as described at Representing vCard Objects in RDF/XML. Again, don't get carried away with the completeness of the vcard record.
Add a disseminator for each content object that disseminates the Dublin Core information as an RSS 1.0 item. The RSS 1.0 documentation on the dc module at http://web.resource.org/rss/1.0/modules/dc/ gives a nice easy example of this item expression format.
Create another data stream in each digital object that is an RDF/XML fragment expressing its relationship to another object in your repository. This RDF fragment should use vocabulary from the simple relationship taxonomy described by your RDF-s. For example, the relationship data stream in the digital object corresponding to a book might express its connection to the digital objects corresponding to the reviews of that book.
Write a small java program that:
- Extracts the rss item fragment disseminations from each of the content objects and combines them into a single xml document representing an rss channel. You should do this via manipulation of the XML as a DOM tree using JDOM, rather than doing textual manipulations. Write the rss channel xml out to a file called rss.xml.
- Extracts the relationship fragment disseminations and joins them into a single jena model. Write the model out into single RDF/XML file called relationships.xml.
You can then use IsaViz to view the RDF graph produced.

Guidance for assignment 2

You should run your fedora repository with the built-in McKoi java-based database. This is the easiest way to get fedora up and running.

Make sure to take a look at some of the sample objects that come with the fedora distribution. The use of XSLT transforms in the sample objects is a template for the type of objects you will set up in your fedora repository.

As said above, don't spend a huge amount of time creating the metadata for each object. Your grade will not be based on how complete the metadata is. You only need enough to supply the material for the rest of the project.

Submission Procedure

You will use the standard CMS submission procedure for packaging your Java code and associated rdf and xml files by the due date. However, it will difficult for you to "submit" your fedora repository to us. Therefore, we will grade you via short 15-20 minute presentations on Tuesday May 18 during which you will have the chance to give an overview of your work. The schedule for presentation is as follows;

Time	Group
9:00	Joseph Egbulefu & Marc Almendarez
9:30
10:00	Ricky M. Yu & Gee-Hsien Chuang
10:30	Stephanie Moy & Ari Tivon Epstein
11:00	Michael Mahar
11:30	Boris Suchkov & Theodore Tang
12:00	Mikolaj Franaszczuk & Gerald Yean
12:30	Dave Vitek & Mike Pape
13:00	Brian Rogan & Karl Schulze
13:30	Mina Radhakrishnan & Patty Reeder
14:00	Benjamin Ee & David Boxer
14:30	Deva Mishra & Chaitanya Desai
15:00
15:30
16:00
16:30
17:00
17:30
18:00	Abhiram Rajendran & Judhajit De
18:30	Jackie Bodine & Vlad Muste
19:00	Arthur Chitikian & Todd Defilippi
19:30	Will Kruse & Matthew Wachs
20:00	Raghav Venkat Agnihothri & Carlos Zednik

[CS 431 Home Page]

Carl Lagoze (lagoze@cs.cornell.edu)
Last changed: 05/18/2004