Client

Thomas Bruce, Director, Legal Information Institute, Cornell Law School
Email: <tom@liicornell.org>

Advisors

Craig Newton, Legal Information Institute, Cornell Law School
Email: <craig@liicornell.org>

Sara Frug, Legal Information Institute, Cornell Law School
Email: <sara.frug@cornell.edu>

Student contact

Anusha Chowdhury, <ac2633@cornell.edu>, is setting up a team for this project. If you are interested in joining the team, please contact her.

Who we are

The Legal Information Institute operates the single most active web site at Cornell, with approximately 165,000 unique visitors per day and over 140 million page views during the last calendar year. This is about two thirds of all Cornell's web traffic.

Since 1992, the LII has been a leader in the application of Internet-based technologies to legal data. It was the first legal website, and one of the first 30 websites in the world. We have worked with several successful CS 5150 project teams over the years.

Last year, the LII provided Federal statutes and regulations to an audience of more than 32 million people from 246 countries. Our work, and the work of the students who work with us, has high visibility at Cornell and within the legislative and executive branches of the Federal government.

The site is valued both for its technical excellence and for its objectivity. Its non-partisan, informed analyses are frequently quoted in publications ranging from the New York Times and Washington Post to the Picayune Item (Louisiana) and the Cherokee One Feather. A reporter for Pro Publica, formerly with the New York Times and Washington Post, has referred to us as “a vital part of our nation’s civic infrastructure”. Our work has appeared in This American Life and on The Colbert Show.

Project summary

Each year, the Congressional Research Service (CRS) produces a document called "The Constitution of the United States, Analysis and Interpretation". Popularly known as the "Constitution Annotated", or "CONAN", it provides legal analysis and interpretation of the Constitution, and particularly of Constitutional case law as decided by the Supreme Court. It is a very highly regarded source of information about the fundamentals of the American system of government. Like the LII, it is prized for its objectivity. It is one of a very few sources of information about the Constitution that is free of partisan bias, in an era where constitutional interpretation strongly influences legislation having to do with health care, with immigration, and with free speech rights.

In 1996, the LII received a copy of CONAN, in XML, from the editors at CRS who were responsible for its preparation. We have been unable to obtain an XML version since then; the only regularly updated, publicly-available version is a PDF version published by the Government Publication Office. Repeated requests for an XML edition -- from the LII, the Sunlight Foundation, and members of the Senate Judiciary Committee -- have gone unanswered for nearly a decade.

Why is it important to do this?

We know from experience that the public cares a great deal about this. During the middle 20 minutes of the first GOP Presidential primary debate of the 2016 election season, half a million people came to view the Fourteenth Amendment on our web site (that one amendment is essential to an understanding of current Federal policy on both healthcare and immigration). No doubt some up-to-date, non-partisan explanation would have been helpful.

From a technical standpoint, the PDF edition published by GPO is effectively unreadable on mobile devices. For another, CONAN has great value as data, associating very specific parts of the text of the Constitution with the court cases that interpret them.

We would like to create an XML version by extracting the text from the PDF edition published by the GPO, and from that XML version create a number of RDF repositories that model the important data contained in CONAN. Massaging that data into a structured format that will support both publication and feature extraction will require ingenuity, and will give insights and experience into the techniques that are applied to legacy data before it can be used in machine-learning and NLP applications.

Major features of such a project would include accurate extraction and inlining of footnotes, identification of "lines of cases" associated with different facets of the analysis, and (optionally) extraction and identification of print and other analytic resources identified in the text. We can provide some code libraries that would assist in the extraction of legal citations, and have fairly well-developed data models for both caselaw and for Constitutional concepts.