Database Colloquium Schedule: Spring 2010

The database colloquium is the weekly meeting of students and faculty interested in data management, data mining, or related topics at Cornell. The colloquium is typically a paper presentation of seminal or recent papers of general interest. While many of the speakers are from the Cornell community, the colloquium also invites outside speakers to talk about their research. The colloquium is held every Monday in from 12:15-1 pm in 5130 Upson Hall.

On those days in which the database colloquium does not have an outside speaker, the colloquium is replaced by a more informal database lunch. This is a short lunch lunch starting at noon followed by an informal paper discussion on a recent topic of interest.





Provenance for Database Transformations

Database transformations (queries, views, mappings) take apart, filter, and recombine source data in order to populate warehouses, materialize views, and provide inputs to analysis tools. As they do so, applications often need to track the relationship between parts and pieces of the sources and parts and pieces of the transformations' output. This relationship is what we call database provenance. This tutorial presents an approach to database provenance that relies on two observations. First, provenance is a kind of annotation, and we can develop a general approach to annotation propagation that also covers other applications, for example to uncertainty and access control. In fact, provenance turns out to be the most general kind of such annotation, in a precise and practically useful sense. Second, the propagation of annotation through a broad class of transformations relies on just two operations: one when annotations are jointly used and one when they are used alternatively. This leads to annotations forming a specific algebraic structure, a commutative semiring. The semiring approach works for annotating tuples, field values and attributes in standard relations, in nested relations (complex values), and for annotating nodes in (unordered) XML. It works for transformations expressed in the positive fragment of relational algebra, nested relational calculus, unordered XQuery, as well as for Datalog, GLAV schema mappings, and tgd constraints. Specific semirings correspond to earlier approaches to provenance, while others correspond to forms of uncertainty, trust, cost, and access control. This is joint work with J.N. Foster, T.J. Green, Z. Ives, and G. Karvounarakis, done in part within the frameworks of the Orchestra and pPOD projects.

Val Tannen, University of Pennsylvania 5130 Upson, 10:30-11:30am
Optimizing Linear Counting Queries Under Differential Privacy

Differential privacy is a rigorous privacy standard that protects against powerful adversaries, offers precise accuracy guarantees, and has been successfully applied to a range of data analysis tasks. When differential privacy is satisfied, participants in a dataset enjoy the compelling assurance that information released about the dataset is virtually indistinguishable whether or not their personal data is included. Differential privacy is achieved by introducing randomness into query answers. The original algorithm for achieving differential privacy, commonly called the Laplace mechanism, returns the true answer after the addition of random noise drawn from a Laplace distribution. If an analyst requires only the answer to a single query about the database, then the Laplace mechanism is known to be optimal. But the Laplace mechanism can be highly suboptimal when a set of correlated queries are submitted, and despite much recent work, optimal strategies for answering a collection of correlated queries are not known in general. In this talk I will review the basic principles of differential privacy and then describe the "matrix mechanism", a new algorithm for answering a workload of predicate counting queries. Given a workload, the mechanism first requests answers to a different set of queries, called a query strategy, which are answered using the standard Laplace mechanism. Noisy answers to the workload queries are then derived from the noisy answers to the strategy queries. When the strategy queries are chosen appropriately, this two stage process increases accuracy (with no cost in privacy) by answering the workload queries using a more complex, correlated noise distribution. I will show that two recently-proposed algorithms, which provide accurate answers for the set of all range queries, can be seen as instances of the matrix mechanism. I will then present results on optimally choosing the query strategy to minimize the error for any given workload. This talk is based on forthcoming work that will appear in PODS 2010 and VLDB 2010, and is joint with Chao Li, Michael Hay, Vibhor Rastogi, Andrew McGregor, and Dan Suciu.

Gerome Miklau, University of Massachusetts -- Amherst 5130 Upson, 1:30-2:30pm

Recent prior semesters: