Database Colloquium

The database colloquium is the weekly meeting of students and faculty interested in data management, data mining, or related topics at Cornell. The colloquium is typically a paper presentation of seminal or recent papers of general interest. While many of the speakers are from the Cornell community, the colloquium also invites outside speakers to talk about their research. The colloquium is held every Monday in from 12:15-1 pm in 5130 Upson Hall.

On those days in which the database colloquium does not have an outside speaker, the colloquium is replaced by a more informal database lunch. This is a short lunch lunch starting at noon followed by an informal paper discussion on a recent topic of interest.





GLADE, A framework for Large Scale Data Processing

In this talk I will describe ongoing work with Florin Rusu on GLADE, a framework that allows large scale data analysis. GLADE is build on top of DataPath, a very efficient database system designed and implemented in collaboration with Chris Jermaine and students/postdocs. Generalized linear aggregates (GLAs) are the main abstraction on GLADE and provide a counterpart to map-reduce type abstractions for specifying user defined complex processing of large data. I will argue in the talk that GLAs are a cleaner, more natural and flexible abstraction to express complex processing. When paired with a high-performance database system like DataPath, GLAs can lead to very efficient processing that rival the state of the art.

GLAs, as an abstraction, offer interesting opportunities that I will discuss as well. First, it liberates the user from any tuning tasks since the system has enough flexibility and knowledge to control the parallel/distributed computation. Second, since GLAs are theoretically clean, correctness of the abstract interface can be automatically verified. Such a mechanism should greatly help with ensuring correctness of the computation. Third, GLAs are composable and paramatrizable. What this immediately allows is the construction of GLA libraries that implement various useful types of processing. Most users can simply use existing GLAs as building blocks for more complex GLAs. Forth, GLAs put less restrictions on the distributed algorithm allowing more flexibility in implementation of fault-tolerant computation.

An ongoing theme will be comparing GLAs and map-reduce at many levels: theoretical, implementation, user experience, etc. This will include a discussion on situations when map-reduce is more appropriate and possible solutions to bring GLAs at the same level.

Bio: Alin Dobra is an Associate Professor at University of Florida. He graduated from Cornell University, Computer Science Department in 2003 under the supervision of Johannes Gehrke. Throughout his career, Alin's interests shifted from data-mining (subject of his Ph.D thesis) to approximate query processing (sketching, sampling, histograms) and more recently to high-performance database systems. The main theme in Alin's research is the pursuit of linear structure and the process of generalization as the main method to simplify analysis, in the case of theoretical work, or simplify development and design in the case of systems work.

Alin Dobra, University of Florida 5130 Upson, 12:15-1pm
No Free Lunch in Data Privacy

Legal requirements and increase in public awareness due to egregious breaches of individual privacy have made data privacy an important field of research. Recent research, culminating in the development of a powerful notion called differential privacy, have transformed this field from a black art into a rigorous mathematical discipline. An algorithm satisfying differential privacy guarantees that the distribution of its outputs changes very little with the addition or deletion of an individual’s record in the data. Differential privacy has been successfully used to publish noisy (yet accurate) answers from statistical databases, where individual records have little or no correlation with others in the table.

In this talk we critically analyze the privacy protections offered by differential privacy on correlated data. Such correlated data arise naturally in social interaction data or in tabular data released along with exact statistics. First, we will describe the trade-off between accuracy of performing ``social recommendations'', or recommendations that are solely based on a user's social network, and the (differential) privacy of sensitive links in the social graph. We show using real networks that good private social recommendations are feasible only for a small subset of the users in the social network or for a lenient setting of privacy parameters. Next, using examples of correlated data, we show that the use of differential privacy can lead to privacy breaches. We present a no-free-lunch theorem to argue that privacy tools like differential privacy, which do not make assumptions about how the data are generated, cannot simultaneously provide both privacy and utility. We propose a participation-based privacy model to overcome the weaknesses of differential privacy.

Bio: Ashwin Machanavajjhala is a Senior Research Scientist in the Knowledge Management group at Yahoo! Research. His primary research interests lie in the area of data management, with specific focus on information extraction/integration, probabilistic reasoning and privacy on the web. Ashwin graduated with a Ph.D. from the Department of Computer Science, Cornell University. His thesis work on defining and enforcing privacy was awarded the 2009 ACM SIGMOD Jim Gray Dissertation Award Honorable Mention. He has also received an M.S. from Cornell University and a B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Madras.

Ashwin Machanavajjhala, Yahoo! Research 5130 Upson, 12:15-1pm

Prior semesters: