CS 501 Software Engineering: Project Suggestion

CS 501
Software Engineering
Spring 2005

Project Suggestions: Legal Information Institute

CS 501 Home

Syllabus

Client

Tom Bruce, Director Legal Information Institute, trb2@cornell.edu.

The Legal Information Institute

Cornell Law School's Legal Information Institute (LII) is the most highly ranked source of legal information on the Web. It is also by far the most heavily used web site at Cornell. Two projects have been proposed.

Cluster Map

The Legal Information Institute receives nearly 10 million hits each week. We know comparatively little about our audience, and need to find out more. Like many web sites, we track only the IP addresses of those who access our services. We would like some way to turn information about IP addresses into substantive information about those who are using our services -- for example, whether or not they are lawyers, and if so, where they practice and what areas of law they practice in. We might simply run a series of "whois" lookups to find out what entities the logged IP addresses belong to, but the results would be tedious to survey.

What we propose is the construction of software that will accept a list of IP addresses of indefinite length, and create from it a "cluster map" of a Web space whose content represents our users. The pages that make up the Web space might be selected by simply selecting home pages belonging to the domains from which our users come; one would then want to cluster these according to some measure of similarity and try to see what sort of affinity group or real-world grouping was represented by the cluster. The software would need to present the clusters as a two-dimensional map, with appropriate viewing capabilities, and allow for interactive tweaking of parameters to produce the most legible views.

One method for doing this might be as follows:

look up domain name for IP address from log file (192.168.2.1 => 'websurfer.somedomain.com')
find a web server belonging to that domain, e.g., 'www.somedomain.com')
pull down its root page and throw it on a heap
repeat 1-3 many times
identify clusters of 'similar' root pages in the heap
map the clusters in the heap and present as an interface that permits the user to read all the clustered pages

Geographic Reference

Geographic radius information is very useful in a wide variety of activities ranging from business web site construction to epidemiology to nonprofit fundraising. Queries like "find all records of type X within radius Y of this [address | phone number | zip code]" are used all the time in building things like store locators, trip planners, yet there seems to be no generally available open-source library that implements radius-finding algorithms in a way that is easily integrated with other open-source applications, especially those running on LAMP (Linux, Apache, mySQL, perl/php) platforms -- though there do seem to be some that are commercially available. The project is to build reusable libraries in Perl, PHP, and Ruby that can perform the following task:

Given a zip code, phone number, GPS coordinates, latitude+longitude, or street address in the United States (without zip), and a database in which some table and field contain a zip code, street address, or telephone number, find all records that are within radius X of the given location. Radii can be expressed in miles as a first cut; further work would involve radii expressed in terms of other units of distance, or of time (e.g., "all records within an hour's drive of Ithaca, NY").

Performance, usability, documentation, and breadth of application are all issues here. The libraries need to be quick, lean, and efficient, and the API needs to be simple to understand, well-documented, and in general easy for other developers to use. They should work with as wide a range of inputs as possible, so as to make the libraries suitable for use in very diverse settings -- for example, a store locator based on zip code, applications driven by GPS receivers, and so on. Coding and documentation practices need to meet whatever standards are imposed by well-known professional distribution channels for open-source, reusable libraries -- e.g., CPAN for Perl, or RAA for Ruby.

William Y. Arms
(wya@cs.cornell.edu)
Last changed: January 21, 2005