CS 501 Software Engineering: Project Suggestion

CS 501
Software Engineering
Spring 2008

Project Suggestions:
The Legal Information Institute

CS 501 Home

Syllabus

Client

Tom Bruce, Director of the Legal Information Institute, trb2@cornell.edu.

The Legal Information Institute

Cornell Law School's Legal Information Institute (LII) is a pre-eminent publisher of open access electronic legal information. It accounts for over 20 percent of Cornell's Web traffic, reaches users in more than 200 countries and territories, and receives more than a million page views each day. It is leader in developing applications that work with legal information and make it more accessible to the public.

In previous years, there have been several successful CS 501 projects for the Legal Information Institute. This year, two projects have been proposed.

The Spaeth database of Supreme Court statistics (Spaeth 2.0)

The LII receives numerous questions about the voting records of Supreme Court justices. These come from a wide range of people, including high-school and college students writing papers and projects, political-science researchers, ordinary citizens, and journalists. The LII has done research projects for a reporter at the Washington Post writing a book on Clarence Thomas, and for the staff of Sixty Minutes.

The main source of answers for such questions is the Spaeth database (http://web.as.uky.edu/polisci/ulmerproject/sctdata.htm), a comprehensive database of Supreme Court statistics developed and maintained by political scientists. It is very difficult to understand and use, which is one reason that people come to the LII for answers rather than consult the Spaeth database directly. Another reason is that the LII is easier to find, and widely recognized as a publisher of Supreme Court opinions.

The purpose of this high-visibility project is to make the information in the Spaeth database easy for an average person to use, and to capture collective wisdom about its contents. Part of the challenge is that its underlying data model is hard to understand, and part is that the database itself is very compactly (some would say cryptically) encoded, in the style of social scientists of perhaps 40 years ago.

This suggests two parts to the project. The first and easier part is to build software that can parse the existing Spaeth data and load it into a relational database, on a regular basis as the Spaeth database is updated. The second and more difficult part is to build a querying system that will allow users to build, refine, store, and publish queries against the database. The aim is to create a system that can capture queries for refinement and re-use by others -- a kind of Wikipedia of questions about the voting records of Supreme Court justices -- that would allow new users to benefit from the others' efforts to formulate meaningful questions and build queries that answer them. In that respect, the project has interesting HCI aspects.

The underlying database should be mySQL, which is standard across the LII operations. Parser coding might be in Perl or (preferably) Ruby; the query-system code should be PHP.

Annotation and management of Federal rulemaking documents

Each year, the Federal government issues more than 8,000 regulations. They cover everything from the size of truck tires to the importation of apricots from Zambia. Most are constructed through a process known as "notice and comment" rulemaking, in which stakeholders and the general public are invited to comment on proposed regulations. This is a process of consultation through which the agency hopes to arrive at the best possible information to use in deciding what to regulate and how. Numbers of comments can be very large (over half a million in some cases) or very small (less than ten). It is difficult for understaffed agencies to handle large numbers of comments, and so there is an incentive to use information technology to help in the management process. The CERI (Cornell e-Rulemaking Initiative, http://ceri.law.cornell.edu/) is a multidisciplinary project aimed at finding ways to do that. Techniques used by CERI have ranged from interface redesign to machine-learning techniques.

Last year, a CS501 team developed the core of an application that acts as a workflow and comment-management tool for agencies formulating regulations, and at the same time serve as a platform for researchers in natural-language processing techniques working with comment data. The task for this year is to extend that application in several significant ways, including better support for workflow functions, addition of various interface and management tools to core annotation functions, and (perhaps most importantly) integration of the database with natural-language processing engines. The system will potentially be placed in use with test groups at the Department of Transportation, the Department of Commerce, the FAA, the EPA, and the Bureau of Industrial Security.

The project will be built on Drupal and mySQL platforms and make heavy use of AJAX techniques. Programming is principally in Javascript and PHP. This is an unusual challenge for CS501 to take on an application that is of great importance to such a diverse and important audience.

William Y. Arms
(wya@cs.cornell.edu)
Last changed: January 10, 2008