The projects are designed to give students the chance to build information
environments using tools, techniques, and mechanisms covered in the lectures.
In general, the assignments will require students to do some design work, understand
relevant protocol or specifications documents, design and write schema, and write a moderate amount of
java code and XSLT code.
Students will work in groups of two on these assignments. There
will be an opportunity for group formation at the beginning of the semester
and groups will remain together for the remainder of the semester. Members
of the group are expected to share information and design ideas, jointly understand
protocol documents and APIs, and write the final code product. Groups
must work independently of each other and any evidence of copying of ideas,
designs, code, etc. will be considered an
academic integrity violation.
Grades will be awarded based on the final product of the group and each student's
contribution to the work of the group.
The assignments assume that students can program in Java and understand how
to download and use class libraries. No java or programming tutorials
will be offered. Assignments will also depend on XSLT coding, which will
be introduced in lecture. However, there will not be detailed XSLT programming
tutorials. Students are expected to use on-line materials or available
books for this.
This is not a programming course. Imaginative algorithms or data structures
will not be required or play a role in grading. Instead, grading will
be based on completion of the assigned task, demonstrated understanding of the
concepts and protocols underlying the assignment, and project design.
Nevertheless, assignments should demonstrate good programming practices and
documentation commensurate with the 400 level of this course.
Programming assignments should be done using the Eclipse IDE. This
is available for free from http://www.eclipse.org
for all major operating systems. Submissions will be in the form of Eclipse
projects.
Working with XML, XSLT, and the like is considerably easier if you don't
have to worry about syntactic details. I highly recommend that you use
oxygen, a very nice
environment for working in the XML/XSLT world. It is available for a 30-day free
trial and has a very attractive academic license fee. It also integrates
as a plug-in for eclipse. Oxygen runs on Windows, Mac OS X, and linux!
All assignments are due as listed on the course calendar. NO LATE ASSIGNMENTS
WILL BE ACCEPTED.
Projects will be submitted as zip files to CMS. They MUST conform
to the following guidelines:
- Each group should identify a group leader when they form. That
person's name will serve as the "firstlast" in the remainder of these instructions.
- The Eclipse project must be named as firstlastassignment# (e.g., CarlLagozeAssignment1)
- The first executable line of the program should be System.out.println("TeamMember1,
TeamMember2")
- The project directory should be submitted as one ZIP file.
Submissions that fail to conform to these guidelines will be rejected.
Due Dates
The project is due in two stages:
- The schema construction part of the assignment, which is described
in detail in step 4 below, is an INDIVIDUAL PROJECT, and will be the
mid-term for the course. Work on this aspect MUST be the sole
effort of each student. Due date for the schema is March 16, 2007
at 11:59PM.
- The remainder of the project (XSLT, Java, etc.) should be done in
two-person project groups. The group can choose to use one a
schema from one of the group members, or the schema supplied by the
instructor. Due date for the group project is April 9, 2007 at
11:59PM.
Overview
The Dublin Core Abstract Model [1] provides a means for packaging
together a group of related resources. In lecture we discussed how
this can be used to express resource relationships such as FRBR [2].
The OAI Protocol for Metadata Harvesting (OAI-PMH) [3] is a mechanism for
harvesting metadata in multiple formats, encoded in XML, from data
providers. In this project you will specialize a schema that defines
the DCMI abstract model for ePrints [4] for the FRBR model. You will
then harvest Dublin Core metadata from CiteSeer [5] and transform that
metadata into XML documents that are valid according to your schema.
IMPORTANT: INDIVIDUAL VS. GROUP PARTS OF PROJECT
Detailed Instructions
-
Review the OAI-PMH specification [3], in particular the description of
the ListRecords request. Formulate the request URL to harvest metadata
from CiteSeer at baseURL
http://cs1.ist.psu.edu/cgi-bin/oai.cgi, narrowing the request to the
metadata format oai_citeseer with records available since
January 1, 2005. You will probably want to experiment with this in
Oxygen to understand the structure of the request and response before
moving on to later steps.
- Examine the schema that specifies the application profile for DC-XML
Eprints documents [4] in Oxygen to understand its structure and semantics.
You will be using this schema in the assignment. You will find it helpful to examine an example of a conformant instance
document [6] in Oxygen.
- Examine the schema that defines the basic 15 Dublin Core elements
[7] to
understand its structure and semantics. You will be using this
schema in the assignment.
- NOTE: THIS PORTION OF THE PROJECT SHOULD BE DONE INDIVIDUALLY BY
EACH STUDENT. PROJECT COLLABORATION IS NOT PERMITTED. Construct a schema that specializes the DC Eprints schema [4] to more
strictly follow FRBR and explicitly include appropriate Dublin Core
elements. The example xml instance file at [8] will help you
understand the goal of your schema - pay particular attention to the
comments in the file. Note the following aspects of this
file compared to the example at [6]:
- Like the file at [6] it has a descriptionSet outer element, it
consists of a set of descriptions, each of which has a set of
statements.
- Unlike the file at [6] the descriptions within a
descriptionSet
are elements specialized to the FRBR entities: work, expression,
manifestation, and item.
- Whereas each description has statements, unlike the file at [6]
DC elements are explicitly included, rather than as
propertyURI
attributes of statements.
- Like the file at [6] higher level FRBR entities, such as works,
are linked to their lower level entity, such as expression.
- You should observe the following guidelines as you produce a
schema that validates an instance document like the template at [8]:
- Your schema should use the appropriate import,
redefine, and
include constructs to reuse, include, or redefine aspects of the
standard DC Eprints schema [4] and the DC elements schema [7].
In other words, you should not repeat any constructs that you
will use unchanged from the existing schema.
- HINT: It is going to take some thinking about how to best
modify the descriptionSet element from [4]. Study
substitution groups as a means of defining alternate content
models or types for an existing declaration. Take a look
at the example presented in class [9], which allows specific
products (such as shirt, umbrella, etc.) to substitute for a
generic product.
- Your schema should make use of namespaces: specifically dc
elements should remain in their respective namespace and the
structural elements of the DC Eprints schema, including your
refinements, should remain in the original
http://purl.org/eprint/epdcx/2006-11-15/ namespace.
- Look carefully at the comments in [8] for constraints about
the respective FRBR description containers. Specifically
note the cardinality constraints and constraints on DC elements
for each FRBR entity.
- NOTE: your schema must define which DC elements are
allowable for each FRBR entity. For example, an element like
dc:rights is appropriate for only ONE FRBR entity.
- Write a Java program that does the following:
-
Issues the OAI-PMH request via HTTP to CiteSeer (see HTTPClient
[10]).
- Extracts each metadata record from the response and writes it to
a separate file. An example of a record is at [13]. You
should use dom4j [12] or
jdom [11] to parse the OAI-PMH response
and extract the first 10 records. You may choose either SAX or DOM
as your parsing model. (However, it is not permissible to
treat your xml as a simple text file and do your work via simple
text manipulation.) You should iteratively name these files
record1.xml,
record2.xml, to record10.xml.
These files will be part of your submission.
-
Write a XSLT document that transforms a record xml document into one
that is valid according to the correct frbr schema at [15]. The
elements of the original record that should be inserted into the new
schema conformant record must include as follows:
- The value of the identifier element from the record header
should be used as the value of the work description
resourceURI attribute in the output xml.
- All dc elements from the original record should be inserted in
their appropriate FRBR entity in the output xml.
- All other descriptive elements (those in the
oai_citeseer namespace) should be placed
in the appropriate FRBR entity using the statement structure.
Note that some of the oai_citeseer such
as author are complex types with an attribute value and children.
In this case you will concatenate the string values together as
shown in the example in [13].
- Modify your Java program so that it does the following:
- Iterate through each of the files produced earlier and read each
in.
-
Processes each file using your XSLT document (see dom4j [12] or
Saxon [14]). The result should be valid according to your schema
(you probably want to test this manually. Write each output
file out as record1-transformed.xml ... record10.xml.
What you should turn in
For the individual part of thee project due on March 16 each student
should submit, via CMS, a single xsd file called frbr.xsd. It should
reference all other files via URLs.
For the group part of the project due on April 9, each group should submit, via CMS, a single zip file of
your eclipse project. You can create this by exporting your eclipse
project and then creating a zip file of the result. This project directory
MUST include:
- Java source file
- XML schema for the metadata format that you use to validate your xml
documents.
- XSLT document to transform from OAI-PMH records to your metadata format.
- The record1.xml ... record10.xml intermediate files
-
The record1-transformed.xml ... record10-transforrmed.xml final files
-
The library jars necessary to run your program
Your project MUST be configured so that we can
easily run your program.
This means that it must include the jar files for any java libraries that you
make use of and your build path MUST reference
those jars internal to the project. Do NOT
configure it so the library build path are references to places in your file
system, making us figure out how to set the build path. Failure to configure your project so
that it is easy for us to run and test will result in significant point loss.
What you will be graded on
Objective criteria for grading:
- well-formedness of XML documents
- validity of instance documents to schema
- conformance of schema to specifications
- ability to run your code and produce output as specified
Subjective criteria for grading:
- Demonstration of logical design decisions in schema
- Demonstration of FRBR and DC principles
- Professionalism in project assembly and packaging
References
[1] DC Abstract Model - http://dublincore.org/documents/abstract-model/
[2] FRBR -
http://www.ifla.org/VII/s13/frbr/frbr.pdf
[3] OAI-PMH version 2.0 -
http://www.openarchives.org/OAI/openarchivesprotocol.html.
[4]
DCAM Eprints model -
http://www.cs.cornell.edu/courses/cs431/2007sp/assignments/Project1/Resources/epdcx.xsd
[5] CiteSeer - http://citeseer.ist.psu.edu/.
[6] DCAM Eprints model instance -http://www.cs.cornell.edu/courses/cs431/2007sp/assignments/Project1/Resources/ex1.xml
[7] DC elements schema -
http://dublincore.org/schemas/xmls/qdc/2006/01/06/dc.xsd.
[8]
Output template -
http://www.cs.cornell.edu/courses/cs431/2007sp/assignments/Project1/Resources/template.xml
[9]
Substitution group example -
http://www.cs.cornell.edu/courses/CS431/2007sp/examples/xml_schema/products.xsd
[10]
Apache Jakarta HTTP Client -
http://jakarta.apache.org/commons/httpclient/
[11] JDOM - http://www.jdom.org/
[12] dom4j - http://dom4j.org/
[13] record example -
http://www.cs.cornell.edu/courses/cs431/2007sp/assignments/Project1/Resources/record.xml
[14]
Saxon XSLT and XQuery Processor -
http://saxon.sourceforge.net/
[15]
http://www.cs.cornell.edu/courses/cs431/2007sp/assignments/Project1/Resources/frbr.xsd
Due Date
The project is due May 14 11:59PM.
Overview
Since Amazon's launch in 1995, the complexity of its business and the
information it manages has grown immensely. From selling just books,
Amazon now is not only a shopping center for a diverse set of products, but is
also a collaborative space that relates products, consumers, and producers in
many ways. Amazon's rapid growth has led to a situation where the entities
and their relationships that are shown on the amazon.com pages are managed in
a ad-hoc fashion. Your managers are intrigued by a briefing you gave them
on semantic web technology. They would like you to prototype additional work in this
area to demonstrate the possibility of a complete redesign of the Amazon
back-end based on these technologies. Specifically, they would like you to:
- Develop an initial ontology in OWL that provides a meta-model for the
entities and relationships within amazon.com. Remember, this is
prototype work and you don't need to model the entire information space!
But, your ontology should at a minimum include the following notions:
- Agents and their sub-types: people and organizations who create and
do things. Some examples of agents are authors, musicians,
publishers, reviewers
- Products and their sub-types: the stuff that Amazon sells. For
this prototype you can limit your sub-types to products that are
intellectual content such as books, DVDs, music, and the like
- Lists and their sub-types: the various aggregations shown on
amazon.com. This includes lists of similar products, ListMania
lists, etc..
- Model an instance of your ontology using Fedora and Amazon ECS.
Your resulting Fedora implementation should have the following
characteristics:
- It should contain digital objects for:
- Classes, and sub-classes, defined in your ontology.
- Instances of classes (books, reviewers, etc.) defined in your
ontology. These instance digital objects should correspond to
items in amazon.com accessible via ECS "lookup" operations. Your
repository should include a subset of the entities on two amazon.com web
pages for products in two genre by one creator. An example is the
author and musician James McBride who has a wonderful book "The
Color of Water" and a nice Jazz CD "The
Process
Volume 1". You do NOT need to instantiate every list item, reviewer,
etc. on the pages your choose - just create enough digital objects to
demonstrate your ontology concepts and their relationships.
- It should define relationships among the digital objects including:
- The sub-class relationships between classes, using the
rdfs:subClassOf property.
- The relationships among the various amazon.com items, using by the
properties in your ontology.
- The types of your amazon.com entities to appropriate classes, using
the rdf:type property.
- The digital objects for the two products should produce xHTML
disseminations that are based on output from the corresponding ECS request and queries to the Fedora
relationship index that return relationships among digital objects.
For example, using the James McBride example again, your digital object
corresponding to "The Color of Water" could disseminate a web page
giving a cover page for the book displaying some metadata and then links
to reviews, lists, etc. In effect, your
Fedora repository should produce a new amazon.com web site using the
disseminations from the Fedora repository and the ECS calls as a basis.
Detailed Instructions
-
Register for use of the Amazon E-Commerce Service (ECS). Registration is available via a
link from
https://aws-portal.amazon.com/gp/aws/developer/registration/index.html.
Registration is free, but you must have an access code to use ECS.
- Review the documentation for the Amazon E-Commerce Service (ECS)
[5]. Experiment with various operations especially ItemLookup, CustomerContentLookup, and ListLookup
operations to understand their output. You will find that oXygen is a big help for examining the structure of the XML responses to these
calls.
- Install Fedora 2.2 [1][2]. The software will install quite easily on
Windows XP, Mac OSX, or various flavors of Linux. I have posted a special
annotated installation instructions to make this as easy as possible.
MAKE SURE YOU LOOK AT AND FOLLOW ALL MY RED MARKUP!
- Start up Fedora and run through the Fedora tutorial [3]. After
finishing this tutorial, you should be comfortable with basic Fedora
concepts needed for the assignment.
- Download the Fedora Image Collection Demo [4] to help you understand how
to encode relationships in Fedora and embed queries to the relationship
index in datastreams, and use them in disseminators. You might want to experiment with the Fedora
Resource Index Search Service [7][8] to further understand how this is done.
- Download the Protege Ontology Editor
(you should install version 3.2 or 3.3 beta, with all plug-ins).
- Complete information about Protege and OWL is available in the Protege
Owl Tutorial [11]. You shouldn't need to run through this entire
tutorial, since the course lecture should provide you with sufficient
background.
- Design your Amazon ontology as a Protege "OWL Files" Project. As
stated above you do not need to model the entire Amazon information space,
but do need to include concepts like Agents, Products, and Lists.
- Download the Racer reasoner [10] to test that your resulting ontology is
consistent. (Note that Racer runs on port 8080 and will not run concurrently
with Fedora). Racer is available free for 30 days. There is also
an academic license if you wish to hold on to it longer.
- Pick two Amazon web pages, related to each other by the same creator;
e.g., the James McBride example mentioned above. These pages will
provide the basis of your Fedora implementation.
- Download and import the following two digital objects, which you will
use in your project:
-
http://www.cs.cornell.edu/courses/cs431/2007sp/assignments/Project2/resources/Proj2Bdef.xml
- This is BDef with one operation, Query.
-
http://www.cs.cornell.edu/courses/cs431/2007sp/assignments/Project2/resources/Proj2Bmech.xml
- This is a BMech, refining the above BDef. The specification of the
BMech is as follows:
- It takes one datastream parameter, with MIME type plain/text, that is an
ITQL query to the resource index.
- It returns a SPARQL XML document that is the response to the query.
- Create the digital objects to demonstrate your model using the Fedora
administrative interface to create the digital objects. Note that you
could also use an XML editor to create the raw FOXML objects and ingest
them, but this is probably harder than creating them through the UI. Do
NOT spend the time creating objects for all the entities on your selected
Amazon web pages. You should only create enough objects to demonstrate
the concepts in your ontology. So for example, you only need to select
a couple of reviews and list items from your pages. Your digital
object design should incorporate the following features
- Digital objects corresponding to classes and sub-classes in your
ontology should include:
- Dublin Core descriptions where the title is the name of the class, type
is "owl:class: and identifier is the URI of the class. You MUST fill
in this Dublin Core metadata in this manner to help with our grading.
- RELS-EXT rdf fragments expressing class/subclass relationships (using
rdf:subClassOf).
- Digital objects corresponding to Amazon items should include:
- Dublin Core descriptions where the title is the name of the item, the
creator is the name of your project group leader, the type is the URI of the
respective class in your ontology, and in the case of products the Dublin
Core identifier is the URL of the Amazon
page corresponding to the item. You MUST fill in this Dublin Core
metadata in this manner to help with our grading.
- A datastream that is a redirect to the amazon.com ECS call "lookup" call
for the item, with ResponseGroup set to medium for products and small for
other Amazon entities.
- RELS-EXT rdf fragments connecting the item to its class digital object
(using rdf:type) and connecting the item to related items using properties
in your ontology (e.g. connecting reviews to an item).
- Digital objects corresponding to Amazon items that are products (e.g.,
the book, DVD, CD, etc.) should have the following characteristics:
- They should have a datastream that is the query input for the Proj2Bmech
that you imported in an earlier step. This query return the PIDS of
the various items (author, reviews, etc.) related to this product.
- They should have a disseminator that employs the Proj2Bmech and consumes
the query input datastream defined above.
- They should have an XSL datastream that prettyprints in xHTML the Amazon
ECS response (don't go overboard with this). The xHTML should also contain a
link to the dissemination produced by Proj2Bmech, providing the linkages of
this product to reviews, etc. (Clearly this link will produce only the
SPARQL xml result, but you could imagine using this as the input for
additional xsl that prettyprinted this).
- They should have a disseminator that employs the built-in Fedora saxon
service to consume this XSL datastream and the Amazon ECS response
datastream to produce the xHTML (see section 7.1 of the Fedora tutorial
[3]).
- Export the two query input datastreams, which are included in your
product Digital Objects. The exported files should be named <pid>.txt,
where <pid> is the PID of the respective digital object.
- Export your FOXML digital objects upon completion. (IMPORTANT!!! When you export your objects, make sure that your "export CONTEXT" is
"archive"!!!)
What you should turn in
You should submit, via cms, a single zip file,
which should include:
- The OWL/XML file for your ontology.
- Your exported FOXML objects for your Fedora repository
- Your query input data streams produced in step 13 above.
What you will be graded on
You will be graded on:
- Ontology Design: completeness and validity.
- Fedora digital object design: correspondence to ontology, completeness,
ability to execute as specified
- Professionalism
You will NOT be graded on:
- Quantity of amazon products represented in you repository.
- Aesthetics of output beyond simple professionalism.
Project 2 Resources
[1] Fedora home page -
http://www.fedora.info.
[2] Fedora installation and configuration guide -
http://www.fedora.info/download/2.2/userdocs/distribution/installation.html
(i have also put up an annotated version of the installation and configuration
guide
here)
[3] Fedora Tutorial -http://www.fedora.info/download/2.0/userdocs/tutorials/tutorial2.pdf
[4] Fedora Demo Documentation -http://www.fedora.info/download/2.2/userdocs/distribution/demos.html
[5] Amazon Web Services -
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=703&categoryID=19
(technical documentation),
http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=5
(code samples and other documentation)
[6] oXygen XML Editor -
http://www.oxygenxml.com
[7] Fedora Resource Index Search Service -
http://www.fedora.info/download/2.2/userdocs/server/webservices/risearch/index.html
[8] Fedora Digital Object Relationships -
http://www.fedora.info/download/2.2/userdocs/digitalobjects/introRelsExt.html
[9] Protege Ontology Editor -
http://protege.stanford.edu/
[10] Racer Reasoner - http://www.racer-systems.com/
[11] Protege Owl Tutorial -
http://www.co-ode.org/resources/tutorials/ProtegeOWLTutorial.pdf