Web Site Profiler

William Arms (wya@cs.cornell.edu)
on behalf of the Library of Congress

Project outline

This project is part of a project with the Library of Congress to collects and preserve open access information from the Web.  We are acting as advisors to the library.  Part of the strategy adopted by the library is to identify specific web sites for their cultural importance and to download them on a regular basis.  The initial list of sites includes the Gore and Bush campaign sites.  For further information see: http://www.cs.cornell.edu./wya/LC-web/.

The web site profiler is a tool to analyze web sites to provide a profile of characteristics that are important in maintaining, mirroring and preserving them.  This proves to be a difficult problem.  For example, it is hard to define the boundaries of many web sites.  Last spring, two M.Eng. students began work on a web site profiler.  They built a simple crawler, in Perl, and began to study the user interface and associated problems.  We now understand the task well enough to develop a production quality profiler.  Although the inital client is the Library of Congress, we expect that this tool will be used in many other applications.

