CS 5150: Software Engineering
Fall 2014

Project Suggestion:
Legal Information Institute, Indexing the Code of Federal Regulations


 

Indexing the Code of Federal Regulations

Client

Sara Frug (ssf6@cornell.edu) and Tom Bruce (tom.bruce@cornell.edu), Legal Information Institute

The Legal Information Institute (LII) is a well-known publisher of open-access legal information. It operates the single most active web site at Cornell, with approximately 150,000 unique visitors per day and over 100 million page views during the last calendar year. Since 1992, the LII has been a leader in the application of Internet-based technologies to legal data. We have worked with several successful CS 5150 project teams over the years.

The problem

The Code of Federal Regulations (CFR) is a sprawling (~250,000 page) compilation of federal regulations published in the “daily newspaper” for all Federal agencies -- the Federal Register. People who are affected by regulations (everyone) need to know "what am I required to do?" . It would be helpful if the answer to that question were something more manageable than 2,500 undifferentiated hits from a full text search. The Office of the Federal Register has, in the past, been legally required to maintain a human compiled, paper-based topical index to the CFR. That is an expensive undertaking, and they might be exempted from maintaining one in the future. We'd like to make it easier to maintain a usable, modern index to the CFR, because it provides tremendous information-organization and discovery value that is not easily supplanted by search.

The goal

Create an extensible application that assists an information specialist in maintaining a usable index to the Code of Federal Regulations.

Why bother if we have Google?

An electronic index supports browsing and faceted search. End-users frequently enter the CFR with a couple of keywords (e.g., “import permit” and “mushrooms”) and a problem (e.g., “how do I appeal the denial of my import permit for mushrooms?”). Because of the organization of the CFR, the information they require is usually not to be found in a single place. An index can:

  • show appropriate granularity (chunk of the corpus (section, part, chapter), level in the topical hierarchy (“agricultural products”, “import/export procedures”, “permits”) level of agency organization (USDA, APHIS)
  • represent topics hierarchically (“the economy”, “farming”, “food safety”) in a way that is easier for users to identify with and employ
  • show adjacencies (at a particular level in the hierarchy or intersection between subjects, what’s on the table? e.g., recordkeeping, permits, inspections), allowing naive users the ability to “look both ways on the bookshelf” to find things that are close to, but not pinpointed by, a full text search
  • make use of external metadata that aids in discovery and understanding regulations promulgated in a single rulemaking (Federal Register)
    • regulations belonging to an agency or program (Federal Register)
    • regulations authorized by a particular statutory provision (Parallel Table of Authorities)
    • regulations frequently violated at once (compliance reports from data.gov)
    • regulations associated with detailed topics from an external ontology (e.g., Agrovoc)

Who will use this software?

Information specialists (assume JD/MLIS-trained) or domain experts (assume training in subject matter and information science). Users will have training in traditional metadata librarianship, domain expertise, and possibly experience working with ontologies. They will be comfortable using or able to work with spreadsheets, word processing software, PowerPoint, free-tagging in Flickr. They will not be expected to be programmers.

The audience for the resulting product potentially numbers in the millions.

What are the challenges?

  • Provide a user-friendly interface for interpreting, refining, and confirming proposed index entries created by machine-learning software
  • Conform to the Thesaurus of Indexing Terms while allowing for better topical organization
  • Make it possible to add additional facets manually

Where will we get the data?

We will use Mallet for topic modeling (http://mallet.cs.umass.edu/). We have the following resources available in machine-readable formats that can be used for structure and training data:

  • CFR Federal Register
  • Thesaurus of Indexing Terms
  • Abridged CFR Index
  • Federal Register with indexing terms

[ Home ]


William Y. Arms
August to December 2014
Please send corrections to wya@cs.cornell.edu