Data Privacy and Security @ Cornell
- A Recent Tutorial
- Code (Note: Release of anonymization toolkit demoed at SIGMOD 2009)
A Recent Tutorial
The digitization of our daily lives has led to an explosion in the collection of personal data by governments, corporations, and individuals. Such information is stored in large databases, and easy access to these databases has resulted in a dramatic increase in the disclosure of private information about individuals. Analogously, many organizations need to guard their proprietary data against unauthorized access and administrators need to be able to express various data access control policies.
It is crucial to outfit today's data management infrastructure with methods that can limit the disclosure of information. In the Data Security and Privacy Group, we research various aspects of these problems including formal definitions of privacy and security, efficient algorithms for checking privacy and security, the trade-off between privacy and utility, and we apply our ideas to different settings like privacy preserving data mining, data publishing, and access control. More specifically, we are working on the following three research directions:
Today's databases contain much sensitive personal information. Thus it is crucial to design database systems which can limit the disclosure of private information. To build such a system one has to answer two fundamental questions – (a) what is the sensitive information that is to be protected? and (b) what is the extent to which the database should limit the disclosure of the specified sensitive information?
For instance, consider a hospital which maintains patient records. The hospital wishes to disclose information to a pharmaceutical company in such a way that the pharmaceutical company cannot infer which patients have which diseases. One method to formally specify privacy policies is to express sensitive information as queries and enforce perfect privacy, a very strong notion of privacy that guarantees that any other query answered by the database will not disclose any information about the sensitive information. Perfect privacy has two attractive properties. First, usage of queries enables the policy writer to express complex privacy policies. Second, perfect privacy is collusion resistant in that adversaries cannot learn any sensitive information by colluding with each other. This collusion-resistance is crucial for efficient query answering when the database answers queries interactively instead of publishing views. When enforcing perfect privacy, the database does not need to keep track of all queries ever answered in order to guarantee that future query answers cannot be combined with old query answers to disclose information
In our research, we are working on practical algorithms for enforcing perfect privacy and other types of database privacy.
- Ashwin Machanavajjhala and Johannes Gehrke. On the Efficiency of Checking Perfect Privacy. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2006).
Privacy-preserving query-answering systems answer queries while provably guaranteeing that sensitive information is kept secret. One very attractive notion of privacy is perfect privacy — a secret is expressed through a query QS, and a query QV is answered only if it discloses no information about the secret query QS. However, if QS and Q V are arbitrary conjunctive queries, the problem of checking whether QV discloses any information about QS is known to be Πp2-complete. In this paper, we show that for large interesting subclasses of conjunctive queries enforcing perfect privacy is tractable. Instead of giving different arguments for query classes of varying complexity, we make a connection between perfect privacy and the problem of checking query containment. We then use this connection to relate the complexity of enforcing perfect privacy to the complexity of query containment.
Personal records of individuals are increasingly being collected by various government and corporate institutions for the purposes of data analysis. To facilitate data analysis, these organizations publish "sufficiently private" views over this collected data. Privacy is a double edged sword -- there should be enough privacy to ensure that sensitive information about the individuals is not disclosed by the views and at the same time there should be enough data to perform the data analysis. Moreover, an adversary who wants to glean sensitive information from the published views usually has some background knowledge about the individuals in the data. In this project, we propose formal definitions of privacy and techniques for data publishing which guarantee privacy of sensitive information, even against adversaries who possess damaging background knowledge while ensuring that the data is useful for data analysis.
- Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Johannes Gehrke, and Lars Vilhuber. Privacy: From Theory to Practice on the Map. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE 2008), Cancun, Mexico, April 2008.
In this paper, we propose the first formal privacy analysis of a data anonymization process known as synthetic data generation, a technique becoming popular in the statistics community. The target application for this work is a mapping program that shows the commuting patterns of the population of the United States. The source data for this application were collected by the U.S. Census Bureau, but due to privacy constraints, they cannot be used directly by the mapping program. Instead, we generate synthetic data that statistically mimic the original data while providing privacy guarantees. We use these synthetic data as a surrogate for the original data. We find that while some existing definitions of privacy are inapplicable to our target application, others are too conservative and render the synthetic data useless since they guard against privacy breaches that are very unlikely. Moreover, the data in our target application is sparse, and none of the existing solutions are tailored to anonymize sparse data.
Recent work has shown the necessity of considering an attacker's background knowledge when reasoning about privacy in data publishing. However, in practice, the data publisher does not know what background knowledge the attacker possesses. Thus, it is important to consider the worst-case. In this paper, we initiate a formal study of worst-case background knowledge. We propose a language that can express any background knowledge about the data. We provide a polynomial time algorithm to measure the amount of disclosure of sensitive information in the worst case, given that the attacker has at most k pieces of information in this language. We also provide a method to efficiently sanitize the data so that the amount of disclosure in the worst case is less than a specified threshold.
- Daniel Kifer and J. E. Gehrke. Injecting Utility into Anonymized Datasets . In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD 2006).
Limiting disclosure in data publishing requires a careful balance between privacy and utility. Information about individuals must not be revealed, but a dataset should still be useful for studying the characteristics of a population. Privacy requirements such as k-anonymity and l-diversity are designed to thwart attacks that attempt to identify individuals in the data and to discover their sensitive information. On the other hand, the utility of such data has not been well-studied.
In this paper we will discuss the shortcomings of current heuristic approaches to measuring utility and we will introduce a formal approach to measuring utility. Armed with this utility metric, we will show how to inject additional information into k-anonymous and l-diverse tables. This information has an intuitive semantic meaning, it increases the utility beyond what is possible in the original k-anonymity and l-diversity frameworks, and it maintains the privacy guarantees of k-anonymity and -diversity.
- Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthu Venkitasubramaniam. l-Diversity: Privacy Beyond k-Anonymity. In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE 2006), Atlanta, Georgia, April 2006
Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least (k−1) other records with respect to certain “identifying” attributes.
In this paper we show with two simple attacks that a k-anonymized dataset has some subtle, but severe privacy problems. First, we show that an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks and we propose a novel and powerful privacy definition called ℓ-diversity. In addition to building a formal foundation for ℓ-diversity, we show in an experimental evaluation that ℓ-diversity is practical and can be implemented efficiently.
Alexandre Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant.
Limiting Privacy Breaches in Privacy Preserving Data Mining.
In Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2006), San Diego, CA, June 2003
There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.
In this paper, we present a new formulation of privacy breaches, together with a methodology, “amplification”, for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from  to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.
Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, and Johannes Gehrke.
Privacy Preserving Mining of Association Rules.
In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta, Canada, July 2002
We present a framework for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions. While it is feasible to recover association rules and preserve privacy using a straightforward "uniform" randomization, the discovered rules can unfortunately be exploited to find privacy breaches. We analyze the nature of privacy breaches and propose a class of randomization operators that are much more effective than uniform randomization in limiting the breaches. We derive formulae for an unbiased support estimator and its variance, which allow us to recover itemset supports from randomized datasets, and show how to incorporate these formulae into mining algorithms. Finally, we present experimental results that validate the algorithm by applying it on real datasets.