Web Lab Collaboration Server

The Web Lab Collaboration Server is a prototype of an integrated tool suite for web data analysis. It operates on top of our local web archive at Cornell, which currently contains five crawls from the Internet Archive and several customized collections of web pages. The Web Lab Collaboration Server includes modules for searching the archive (metadata search and limited content search), generating extraction rules (wrappers) through a visual interface, extracting structured data sets (tables) from page sets, and sharing these objects with other users.

Required: Extensive experience with Java 1.5, GUI programming, JDBC, Javascript/DHTML.
Optional: Knowledge of AJAX, RPC, Google Web Toolkit, JSON, AOP is useful, but can be acquired on the job.
Credits: Class project or M.Eng. project
Team members: Michal Kuklis, Wioletta Holownia, Daniela Balmus (Fall 2007 - Spring 2008).
Status: Ongoing

Web Lab Services

The Cornell Web Lab Project provides an infrastructure for analyzing a large archive of web data. As part of this infrastructure, we are currently developing the Web Lab Services, a library of general-purpose services that can be used by various applications such as the Web Lab Collaboration Server (see above), the crawl tool (see below) and the Web Lab web site at weblab.infosci.cornell.edu. A subset of the service functionality can be taken over from the Web Lab Collaboration Server. However, a substantial amount of code must be written from scratch, based on an existing detailed specification of the service interfaces and the underlying database schema.

Required: Extensive experience with Java 1.5 (especially generics), JDBC.
Optional: Knowledge of AJAX, RPC and JSON is useful, but can be acquired on the job.
Credits: Class project or M.Eng. project
Team members: Michal Kuklis, Wioletta Holownia, Natasha Qureshi (Fall 2007 - Spring 2008).
Status: Ongoing
Co-supervisor: Manuel Calimlim

Crawl Tool

Cornell's web archive currently contains four complete web crawls from the Internet Archive, which were created before 2006. For researchers who want to analyze more recent data or pages from specific sites, we have built a crawl tool on top of the Heritrix open-source crawler. The tool allows users to specify crawl parameters and run the crawler asynchronously. The crawled web pages are automatically added to our repository and hence available to other tools, such as the Web Lab Collaboration Server (see above).

Required: Experience with Java 1.5, JDBC.
Optional: Knowledge of Heritrix is useful, but can be acquired on the job.
Credits: M.Eng. project
Team members: Madhav Puri (Fall 2007 - Spring 2008).
Status: Finished
Co-supervisor: Manuel Calimlim

Full-text Indexing

Cornell Web Lab users would like to search the archive contents for occurrences of specific keywords in conjunction with metadata search. We have used the open-source web-search software NutchWAX to create a full-text index of parts of our local web archive, using a 32-node compute cluster at Cornell's Center for Advanced Computing.

Required: Experience with Java.
Optional: Knowledge of Nutch and NutchWAX is useful, but can be acquired on the job.
Credits: Class project or M.Eng. project
Team members: Kyeongseo Hwang, Jung Kwan Kim, Hardeep Singh (Spring 2007).
Status: Finished with preliminary results; to be continued.