CS/INFO 431 - Assignments

Project Overview

The projects are designed to give students the chance to build information environments using tools, techniques, and mechanisms covered in the lectures. In general, the assignments require students to do some design work, understand relevant protocol or specifications documents, design and write schema, and write a moderate amount of code.

Students work individually on these assignments. Students must work independently of each other and any evidence of copying of ideas, designs, code, etc. will be considered an academic integrity violation. Obviously, students may use the Web as a reference source, code library, and source of inspiration. However, aspects of assignments that derive from Web resources must be cited. Failure to do this will be considered an academic integrity violation.

The assignments assume that students can program in a high-level language (Java, PHP, Python, Ruby) and understand how to download, install, and use libraries and modules. No programming tutorials will be offered. Assignments will also depend on manipulating XML (e.g., parsing) and XSLT coding, both of which will be introduced in lecture. Students are expected to learn the details of these technologies via on-line materials or available books for this.

Grading Criteria

This is not a programming course. Imaginative algorithms or data structures will not be required or play a role in grading. Instead, grading will be based on completion of the assigned task, demonstrated understanding of the concepts and protocols underlying the assignment, and project design. Nevertheless, assignments should demonstrate good programming practices and documentation commensurate with the 400 level of this course.

Tools

Programming assignments should be done using the Eclipse IDE. This is available for free from http://www.eclipse.org for all major operating systems. Submissions will be in the form of Eclipse projects.

Working with XML, XSLT, and the like is considerably easier if you don't have to worry about syntactic details. I highly recommend that you use oxygen, a very nice environment for working in the XML/XSLT world. It is available for a 30-day free trial and has a very attractive academic license fee. It also integrates as a plug-in for eclipse. Oxygen runs on Windows, Mac OS X, and linux!

Submission

All assignments are due as listed on the course calendar. NO LATE ASSIGNMENTS WILL BE ACCEPTED.

Projects will be submitted as zip files to CMS. They MUST conform to the following guidelines:

A student's name will serve as the "firstlast" in the remainder of these instructions.
The Eclipse project must be named as firstlastassignment# (e.g., CarlLagozeAssignment1)
The Eclipse project directory should be submitted as one ZIP file.
The project should be self-contained: it should be possible to import it into Eclipse on another computer (e.g., the graders) and run it without having to resolve any library dependencies.

Submissions that fail to conform to these guidelines will be rejected.

Project 1

Due Date:

April 7, 2008 11:59PM

In this assignment you will compose a mashup of information on a science subject. Two of the information sources, Google news and Yahoo news, return news of the day on the respective subject and one, the National Science Digital Library, returns background reference sources on the subject. To do this you will use the following skills covered during the semester thus far:

Programmatic URL access
Basic XML
XML schema design and construction
XSLT transformation
Syndication and harvesting

The following instructions provide details for the project. Please read them and follow them carefully:

Your mashup must combine search results on a common topic from the three following sources (in all three cases you may limit your result set to 10 items):
1. Google news: specified to return an ATOM feed. An example URL for this type of search is http://news.google.com/news?q=%22Global%20Warming%22&output=atom.
2. Yahoo news: specified to return an RSS 2.0 feed. An example URL for this type of search is http://news.search.yahoo.com/news/rss?p=global+warming.
3. National Science Digital Library (NSDL): returned in the NSDL REST XML format. An example URL for this type of search is http://ndrsearch.nsdl.org/search?n=10&q='Global%20Warming'&s=0.
The format for your mashup will be defined by a schema which you design and write. The rules for your schema design are as follows:
1. It must built on this base schema. You must extend the base schema definitions without copying that schema file and making any changes to it.
2. It must use entity definitions and respective name spaces from the following schema:
3. It must be in one file separate from the base schema.
4. The entity definitions in your new schema must be in the same namespace as the target namespace of the base schema.
5. It must validate this template XML document that you should use as a sample of the type of document your mashup will produce.
Your finished project must exhibit the following functionality:
1. It should be invokable as follows:
  1. REQUIRED: Standalone, where the query terms are specified as a parameter to the program at startup.
  2. OPTIONAL: As a server based script where the query terms are entered via a web form. (You will use this when you demonstrate your project to us. Please let us know if you can't run a web server on your laptop and we will set up an account for you on a course server)
2. It must submit the query to the three data sources via HTTP requests.
3. It must combine and convert via XSLT the results of the three queries into one XML mashup document that is associated with and valid according to the schema you constructed in as defined above.
4. It must produce an xHTML document (browser viewable and valid according to the strict xHTML definition) that is a human readable representation of your mashup document.
  1. This xHTML must be produced via an XSLT transform that is either integrated with the transform described above (note that this will require you to produce two output documents from this transform) or via a separate transform of the XML mashup document.
  2. This xHTML document must link to your XML source document using the <link> convention used for RSS and ATOM feeds. You should use the Internet Media Type application/mashup+xml.

Here are a few helpful hints (this list may expand throughout the assignment):

Make sure you understand the use of the include and import constructs as you write your schema. These are important tools for controlling the namespaces of entities that you define.
Make sure you understand the use of the substitutionGroup and abstract constructs as you write your schema. These are important tools for schema type extension.

Your project must be submitted in the manner defined above. It must include:

The code (in PHP, Perl, Java, or Python) that accepts the queries, invokes the HTTP requests, invokes the XSLT transform(s), and returns the xHTML.
The XSLT scripts(s).
The schema document.

You will also demonstrate your project in a 15 minute slot on April 8. You must schedule this time via CMS.

The grading criteria for the project are as follows:

Objective criteria for grading:
1. conformance of schema to specifications
2. well-formedness of XML documents
3. validity of instance documents to schema
4. construction of XSLT transform(s)
5. ability to run your code and produce output as specified
Subjective criteria for grading:
1. Demonstration of logical design decisions in schema
2. Professionalism in project assembly and packaging

Project 2

Due: May 12, 2008 11:59PM (NO LATE SUBMISSIONS ACCEPTED)

In this assignment you will use semantic web technologies to manipulate data from the Amazon Associates Web Service (AWS). The API to this service provides full access to structured XML data from Amazon for operations such as searching, item information, purchasing, etc. The components of this project are as follows with additional details provided below:

You will develop an OWL-RDF ontology for the basic entities and relationships within amazon.com. This ontology does not have to model the entire Amazon information space! But, your ontology should at a minimum include the following notions:
1. Agents and their sub-types: people and organizations who create and do things. Some examples of agents are authors, musicians, publishers, reviewers
2. Products and their sub-types: the stuff that Amazon sells. For this prototype you can limit your sub-types to products that are intellectual content such as books, DVDs, music, and the like
3. Lists and their sub-types: the various aggregations shown on amazon.com. This includes lists of similar products, ListMania lists, etc.
Use Jena to load up an in-memory RDF model that conforms to your schema using item data returned from search requests to AWS.
Run inferencing on your instance data based on the assertions in your ontology.
Implement an interface to your model through which SPARQL queries can be entered to demonstrate the effects of inferencing.

Detailed Instructions

Register for use of the Amazon Web Services (AWS). Registration is available via a link from http://www.amazon.com/gp/browse.html?node=3435361. Registration is free, but you must have an access code to use AWS.
Download the documentation for the AWS (http://docs.amazonwebservices.com/AWSECommerceService/2008-03-03/DG/).
Experiment with the ItemSearch, ItemLookup, and ListLookup operations to understand their structure, options, and XML output. I highly recommend you use the REST interface (simple URI requests) rather than the too-complex SOAP interface. You will find that Oxygen is a big help for examining the structure of the XML responses to these calls.
Download the Protege Ontology Editor (http://protege.stanford.edu/). You should install version 3.3.1 or 3.4 beta, with all plug-ins.
Download the Pellet DL reasoner (http://pellet.owldl.com/).
Complete information about Protege and OWL is available in the Protege Owl Tutorial (http://www.co-ode.org/resources/tutorials/protege-owl-tutorial.php). You shouldn't have to run through this entire tutorial, since the course lectures should provide you with sufficient background.
Design your Amazon ontology in Protege. As stated above you do not need to model the entire Amazon information space. Your ontology should be designed with a usage scenario like the following in mind.
1. Load your model with data from the following queries to Amazon:
  1. Search for CDs where Madonna is an artist.
  2. Search for DVDs where Madonna is a actor.
  3. Search for Books where Madonna is an author.
  4. Search for DVDs where Antonio Banderas is an actor.
2. Run a reasoner over your model using your ontology as the basis for inferencing.
3. Perform the following SPARQL queries on your model.
  1. Return all the movies in which Madonna acted.
  2. Return all the products in which Madonna is an agent.
  3. Return all the actors who appeared with each other.
Some guidelines for your ontology design are as follows:
1. You should include main concepts like Agents, Products, and Lists, with appropriate sub-classes. For example, your ontology should represent the notion that authors of books and artist of CDs are subsumed within the broader Agent covering class.
2. You should include properties (relationships) among these concepts in a hierarchy. For example, your ontology should represent the notion that there is a relationship between an author and a book, and a relationship between an artist and CD, both of which are subsumed by a covering relationship.
3. Your ontology should include at least one transitive relationship. A good choice is the notion of "appeared with" in a movie.
4. Your ontology should include domain and range constraints on properties.
5. Your ontology does not need to include restrictions. In fact, if you include these you will probably explode the triples in the inferencing process beyond the capacity of your machine.
6. Your ontology does not need to include inverse properties.
Download the java libraries for the Jena Semantic Web Framework for Java - http://jena.sourceforge.net/.
Write a java program that does the following:
1. Accepts a query input consisting of an Agent string (e.g., Madonna) and a genre (either DVD, CD, or Book) translates these to an AWS ItemSearch URI. For example, the query terms "Madonna" and "DVD" should translate to an AWS search for all DVDs in which Madonna is an actor.
2. Issues the ItemSearch URI to Amazon. You will probably want to set ResponseGroup to "Large" to get as much information back from the search as possible.
3. Translates the XML return from the ItemSearch to triples that conform to your ontology and load those triples into a Jena model. You may do this in either of the following methods:
  1. Run an XSLT transform over the XML from AWS and translate it to a serialization of the triples that you want to load into your model. This serialization may be RDF-XML, N3, or N-TRIPLES. You will probably find the last two easier to generate than RDF-XML. You can then use Jena's I/O operations to read the serialization, parse it, and generate the model triples.
  2. Parse the AWS XML response (DOM or SAX) and generate appropriate model building calls in Jena.
4. Loops to accept another query (the results of which will be added to the model), or allows the user to specify that query input is complete.
5. Runs the reasoner over the model generated from the Amazon data, using your ontology as the schema for inferencing. This should generate a second inference model.
6. Loops to allow SPARQL queries over either the asserted or inference model and formats the results in some reasonably readable format.

NOTE: There are numerous opportunities for extra credit work on this project. For example, you may:

Implement a persistent store interface (e.g., mysql) and load a substantial amount of data in that store for subsequent queries.
Expand you ontology with restrictions and resulting multiple inheritance. Be careful of expanding your model beyond the memory capacity of your machine.
Link to Graphviz to allow graphical viewing of your model.
Include data from some other web source - e.g., artists pages on myspace.

I am open to extra credit but please see me before hand.

You should submit, via cms, a single zip file, which should include:

The OWL/XML file for your ontology.
The Java code.
XSLT code if you use that to transform AWS results.
A README file describing query inputs for a sample run and matching SPARQL queries that demonstrate the manner in which reasoning using your schema affects query results.

You will be graded on:

Ontology Design: completeness and validity.
Java Program: correspondence of ontology to loaded model, completeness, ability to execute as specified
Professionalism