Data Privacy and Security @ Cornell


Privacy in Data Publishing. Tutorial at the 2009 IEEE Symposium on Security and Privacy. Part I, Part II.


The digitization of our daily lives has led to an explosion in the collection of personal data by governments, corporations, and individuals. Such information is stored in large databases, and easy access to these databases has resulted in a dramatic increase in the disclosure of private information about individuals. Analogously, many organizations need to guard their proprietary data against unauthorized access and administrators need to be able to express various data access control policies.

It is crucial to outfit today's data management infrastructure with methods that can limit the disclosure of information. In the Data Security and Privacy Group, we research various aspects of these problems including formal definitions of privacy and security, efficient algorithms for checking privacy and security, the trade-off between privacy and utility, and we apply our ideas to different settings like privacy preserving data mining, data publishing, and access control. More specifically, we are working on the following research directions:

Disclosure Control and Explainable Security

Companies and organizations collect and use vast troves of sensitive user data whose release must be carefully controlled. In practice, the access policies that govern this data are often fine-grained, complex, poorly documented, and difficult to reason about. As a result, principals frequently request and are granted access to data they never use. The problem is particularly salient in the context of app ecosystems, in which users store personal data on platforms such as Facebook or smartphones and APIs enable third-party applications (apps) to utilize that data.

In this project, we have developed a new formal model of data disclosure, which we present in our SIGMOD 2013 paper. In contrast with previous solutions, our model is data-derived and semantically meaningful. Information disclosure is modeled in terms of a set of distinguished security views. Each query is labeled with the precise set of security views that is needed to answer it, and these labels drive policy decisions.

To encourage developers and administrators to use security mechanisms more effectively, we have built on the above foundation to create a security model in which all security decisions are formally explainable. We introduce this model in our SIGMOD 2014 paper. Whether a query is accepted or denied, the system returns a concise yet formal explanation which can allow the issuer to reformulate a rejected query or adjust his/her security credentials. Our approach has a strong formal foundation based on previously unexplored connections between disclosure lattices and policy algebras. We build on this foundation and implement a disclosure control system that handles a wide variety of real SQL queries and can accommodate complex policy constraints. The system has been released as open source software and is available here.

Data Security and Privacy

Today's databases contain much sensitive personal information. Thus it is crucial to design database systems which can limit the disclosure of private information. To build such a system one has to answer two fundamental questions – (a) what is the sensitive information that is to be protected? and (b) what is the extent to which the database should limit the disclosure of the specified sensitive information?

For instance, consider a hospital which maintains patient records. The hospital wishes to disclose information to a pharmaceutical company in such a way that the pharmaceutical company cannot infer which patients have which diseases. One method to formally specify privacy policies is to express sensitive information as queries and enforce perfect privacy, a very strong notion of privacy that guarantees that any other query answered by the database will not disclose any information about the sensitive information. Perfect privacy has two attractive properties. First, usage of queries enables the policy writer to express complex privacy policies. Second, perfect privacy is collusion resistant in that adversaries cannot learn any sensitive information by colluding with each other. This collusion-resistance is crucial for efficient query answering when the database answers queries interactively instead of publishing views. When enforcing perfect privacy, the database does not need to keep track of all queries ever answered in order to guarantee that future query answers cannot be combined with old query answers to disclose information

In our research, we are working on practical algorithms for enforcing perfect privacy and other types of database privacy.

Privacy-preserving query-answering systems answer queries while provably guaranteeing that sensitive information is kept secret. One very attractive notion of privacy is perfect privacy — a secret is expressed through a query QS, and a query QV is answered only if it discloses no information about the secret query QS. However, if QS and Q V are arbitrary conjunctive queries, the problem of checking whether QV discloses any information about QS is known to be Πp2-complete. In this paper, we show that for large interesting subclasses of conjunctive queries enforcing perfect privacy is tractable. Instead of giving different arguments for query classes of varying complexity, we make a connection between perfect privacy and the problem of checking query containment. We then use this connection to relate the complexity of enforcing perfect privacy to the complexity of query containment.

Privacy Preserving Data Publishing

Personal records of individuals are increasingly being collected by various government and corporate institutions for the purposes of data analysis. To facilitate data analysis, these organizations publish "sufficiently private" views over this collected data. Privacy is a double edged sword -- there should be enough privacy to ensure that sensitive information about the individuals is not disclosed by the views and at the same time there should be enough data to perform the data analysis. Moreover, an adversary who wants to glean sensitive information from the published views usually has some background knowledge about the individuals in the data. In this project, we propose formal definitions of privacy and techniques for data publishing which guarantee privacy of sensitive information, even against adversaries who possess damaging background knowledge while ensuring that the data is useful for data analysis.

In this paper, we propose the first formal privacy analysis of a data anonymization process known as synthetic data generation, a technique becoming popular in the statistics community. The target application for this work is a mapping program that shows the commuting patterns of the population of the United States. The source data for this application were collected by the U.S. Census Bureau, but due to privacy constraints, they cannot be used directly by the mapping program. Instead, we generate synthetic data that statistically mimic the original data while providing privacy guarantees. We use these synthetic data as a surrogate for the original data. We find that while some existing definitions of privacy are inapplicable to our target application, others are too conservative and render the synthetic data useless since they guard against privacy breaches that are very unlikely. Moreover, the data in our target application is sparse, and none of the existing solutions are tailored to anonymize sparse data.
  • David Martin, Daniel Kifer, Ashwin Machanavajjhala, Johannes Gehrke, and Joseph Y. Halpern. Worst Case Background Knowledge for Privacy-Preserving Data Publishing. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey, April 2007
  • Recent work has shown the necessity of considering an attacker's background knowledge when reasoning about privacy in data publishing. However, in practice, the data publisher does not know what background knowledge the attacker possesses. Thus, it is important to consider the worst-case. In this paper, we initiate a formal study of worst-case background knowledge. We propose a language that can express any background knowledge about the data. We provide a polynomial time algorithm to measure the amount of disclosure of sensitive information in the worst case, given that the attacker has at most k pieces of information in this language. We also provide a method to efficiently sanitize the data so that the amount of disclosure in the worst case is less than a specified threshold.

    Limiting disclosure in data publishing requires a careful balance between privacy and utility. Information about individuals must not be revealed, but a dataset should still be useful for studying the characteristics of a population. Privacy requirements such as k-anonymity and l-diversity are designed to thwart attacks that attempt to identify individuals in the data and to discover their sensitive information. On the other hand, the utility of such data has not been well-studied.

    In this paper we will discuss the shortcomings of current heuristic approaches to measuring utility and we will introduce a formal approach to measuring utility. Armed with this utility metric, we will show how to inject additional information into k-anonymous and l-diverse tables. This information has an intuitive semantic meaning, it increases the utility beyond what is possible in the original k-anonymity and l-diversity frameworks, and it maintains the privacy guarantees of k-anonymity and -diversity.

    Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least (k−1) other records with respect to certain “identifying” attributes.

    In this paper we show with two simple attacks that a k-anonymized dataset has some subtle, but severe privacy problems. First, we show that an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks and we propose a novel and powerful privacy definition called ℓ-diversity. In addition to building a formal foundation for -diversity, we show in an experimental evaluation that -diversity is practical and can be implemented efficiently.

    Privacy Preserving Data Mining






    This research has been supported by the National Science Foundation under Grant CNS-0627680 and IIS-0121175, and by gifts from Microsoft Research and Yahoo! This research has also been supported by a Google Research Award, and we are very thankful to Google for their support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.