CS 5150
Software Engineering
Fall 2010

Project Suggestion:
Harvester for Cornell Research


Home

Syllabus

Projects

Books and Readings

Assignments

Quizzes

Academic Integrity


About this site

 

eCommons

eCommons@Cornell is a digital repository that is open to anyone affiliated with Cornell University. It is a place to capture, store, index, preserve and redistribute materials in digital formats for educational, scholarly, research or historical purposes (http://ecommons.library.cornell.edu/).

It is built on the widely-used open-source DSpace software. This project will develop software that will be used at Cornell and also incorporated into future releases of DSpace.

Harvester for Cornell Research (Phase 2)

Client

John M. Saylor, Cornell University Library
email: jms1@cornell.edu
phone: 607-255-4134

CS 5150 contacts

The following people are forming a team to undertake this project. Please contact them if you would like to join the team.

Tanvi Goel (tg238@cornell.edu)
Uday Babbar (ub25@cornell.edu)
Nandini Shetty(ns567@cornell.edu)

Harvesting from Web sites: Phase 2

Cornell faculty and researchers often post their journal articles and papers on their own or departmental servers but do not deposit them in eCommons because it is inconvenient, or they are uncomfortable about their intellectual property right agreements with publishers. The goal of this project is to develop a system to overcome these barriers by semi-automatically harvesting, cataloging, and collecting research publications that are already posted on departmental or individual web sites in the cornell.edu domain.

As a first phase, a CS 5150 project in spring 2009 developed a system to crawl departmental and local servers looking for PDF files. It has a pleasingly accurate heuristic for predicting whether a file is research paper and builds a database to track which files have been collected previously.

For Phase 2, the goals are:

  • Automatically generate metadata for each file.
  • Send email to authors for permission to archive their papers in either in the open access portion of eCommons or in a restricted portion if the rights do not allow open access.
  • Determine if it has been previously submitted to or published in a journal or conference (by searching Google, or a database such as Web of Science, Compendex, etc)
  • Determine the publisher's agreement for self archiving (by looking up in the Sherpa/Romeo database, http://www.sherpa.ac.uk/romeo.php), or searching for the journal's information that describes the author's rights.
  • Build a database of these author-publisher agreements for each journal or publisher.
    authors whose material is crawled.

[ Home ]


William Y. Arms
Last changed: August 2010