- iReduct: Differential Privacy with Reduced Relative Errors
Xiaokui Xiao, Gabriel Bender, Michael Hay, Johannes Gehrke
SIGMOD 2011
pdf
X
Abstract
Prior work in differential privacy has produced techniques for
answering aggregate
numerical
queries over sensitive data in a privacy-preserving way. These
techniques achieve privacy by adding noise to the query answers.
Their objective is typically to minimize absolute errors while
satisfying
differential privacy. Thus,
query answers are injected with noise whose scale is
independent of whether the answers are large or small. The noisy
results for queries whose true answers are small therefore tend to
be dominated by noise, which leads to inferior data utility.
This paper introduces iReduct, a differentially private
algorithm for computing
answers with reduced relative errors. The basic idea of
iReduct is to inject different amounts of noise to different
query results, so that smaller (larger) values are more likely to be
injected with less (more) noise. The algorithm is based on a novel
resampling technique that employs correlated noise to improve
data utility. Performance is evaluated on an instantiation of
iReduct that generates marginals, i.e., projections of
multi-dimensional histograms onto subsets of their attributes.
Experiments on real data demonstrate the effectiveness of our
solution.
- Enabling Accurate Analysis of Private Network Data (awarded ACM SIGKDD Doctoral Dissertation Award)
Michael Hay
Ph.D. Thesis, 2010
pdf
- Resisting Structural Re-identification in Anonymized Social Networks
Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Chao Li
VLDB Journal 2010
pdf X
Abstract We identify privacy risks associated with releasing network datasets and provide an algorithm that mitigates those risks. A network dataset is a graph representing entities connected by edges representing relations such as friendship, communication, or shared activity. Maintaining privacy when publishing a network dataset is uniquely challenging because an individual's network context can be used to identify them even if other identifying information is removed.
In this paper, we introduce a parameterized model of structural knowledge available to the adversary and quantify the success of attacks on individuals in anonymized networks. We show that the risks of these attacks vary based on network structure and size, and provide theoretical results that explain the anonymity risk in random networks. We then propose a novel approach to anonymizing network data that models aggregate network structure and allows analysis to be performed by sampling from the model. The approach guarantees anonymity for entities in the network while allowing accurate estimates of a variety of network measures with relatively little bias.
- Optimizing Linear Counting Queries Under Differential Privacy
Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, Andrew McGregor
PODS 2010
pdf
X
Abstract
Differential privacy is a robust privacy standard that has been successfully applied to a range of data analysis tasks. But despite much recent work, optimal strategies for answering a collection of related queries are not known.
We propose the matrix mechanism, a new algorithm for answering a workload of predicate counting queries. Given a workload, the mechanism requests answers to a different set of queries, called a query strategy, which are answered using the standard Laplace mechanism. Noisy answers to the workload queries are then derived from the noisy answers to the strategy queries. This two stage process can result in a more complex correlated noise distribution that preserves differential privacy but increases accuracy.
We provide a formal analysis of the error of query answers produced by the mechanism and investigate the problem of computing the optimal query strategy in support of a given workload. We show this problem can be formulated as a rank-constrained semidefinite program. Finally, we analyze two seemingly distinct techniques, whose similar behavior is explained by viewing them as instances of the matrix mechanism.
- Boosting the Accuracy of Differentially-Private Histograms Through Consistency
Michael Hay, Vibhor Rastogi, Gerome Miklau, Dan Suciu
VLDB 2010
arXiv
X
X
Abstract We show that it is possible to significantly improve the accuracy of a general class of histogram queries while satisfying differential privacy. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The final output is differentially-private and consistent, but in addition, it is often much more accurate. On real datasets we show these techniques can be used for estimating the degree sequence of a graph very precisely, and for computing a histogram that can support arbitrary range queries accurately.
- Accurate Estimation of the Degree Distribution of Private Networks (awarded Best Student Paper)
Michael Hay, Chao Li, Gerome Miklau, David Jensen
ICDM 2009
pdf
X
X
Abstract
We describe an efficient algorithm for releasing a provably private estimate of the degree distribution of a network. The algorithm satisfies a rigorous property of differential privacy, and is also extremely efficient, running on networks of 100 million nodes in a few seconds.
Theoretical analysis shows that the error scales linearly with the number of unique degrees, whereas the error of conventional techniques scales linearly with the number of nodes. We complement the theoretical analysis with a thorough empirical analysis on real and synthetic graphs, showing that the algorithm's variance and bias is low, that the error diminishes as the size of the input graph increases, and that common analyses like fitting a power-law can be carried out very accurately.
- Enabling Accurate Analysis of Private Network Data
Michael Hay, Gerome Miklau, David Jensen
Draft book chapter, Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques, Chapman & Hall/CRC Press.
pdf
X
Abstract
This is a draft copy of a chapter in the forthcoming book
-
Relationship Privacy: Output Perturbation for Queries with Joins
Vibhor Rastogi, Michael Hay, Gerome Miklau, Dan Suciu
PODS 2009
pdf
full
X
Abstract We study privacy-preserving query answering over data containing relationships. A social network is a prime example of such data, where the nodes represent individuals and edges represent relationships. Nearly all interesting queries over social networks involve joins, and for such queries, existing output perturbation algorithms severely distort query answers. We propose an algorithm that significantly improves utility over competing techniques, typically reducing the error bound from polynomial in the number of nodes to poly-logarithmic. The algorithm is, to the best of our knowledge, the first to answer such queries with acceptable accuracy, even for worst-case inputs.
The improved utility is achieved by relaxing the privacy condition. Instead of ensuring strict differential privacy, we guarantee a weaker (but still quite practical) condition based on adversarial privacy. To explain precisely the nature of our relaxation in privacy, we provide a new result that characterizes the relationship between ǫ-indistinguishability (a variant of the differential privacy definition) and adversarial privacy, which is of independent interest: an algorithm is \epsilon-indistinguishable iff it is private for a particular class of adversaries (defined precisely herein). Our perturbation algorithm guarantees privacy against adversaries in this class whose prior distribution is numerically bounded.
- Resisting Structural Re-identification in Anonymized Social Networks
Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis
VLDB 2008
pdf X
Abstract We identify privacy risks associated with releasing network data sets and provide an algorithm that mitigates those risks. A network consists of entities connected by links representing relations such as friendship, communication, or shared activity. Maintaining privacy when publishing networked data is uniquely challenging because an individual's network context can be used to identify them even if other identifying information is removed. In this paper, we quantify the privacy risks associated with three classes of attacks on the privacy of individuals in networks, based on the knowledge used by the adversary. We show that the risks of these attacks vary greatly based on network structure and size. We propose a novel approach to anonymizing network data that models aggregate network structure and then allows samples to be drawn from that model. The approach guarantees anonymity for network entities while preserving the ability to estimate a wide variety of network measures with relatively little bias.
- Anonymizing social networks
Michael Hay, Gerome Miklau, David Jensen, Philipp Weis, and Siddharth Srivastava
University of Massachusetts Amherst Technical Report 2007
pdf X
Abstract Advances in technology have made it possible to collect data about individuals and the connections between them, such as email correspondence and friendships. Agencies and researchers who have collected such social network data often have a compelling interest in allowing others to analyze the data. However, in many cases the data describes relationships that are private (e.g., email correspondence) and sharing the data in full can result in unacceptable disclosures. In this paper, we present a framework for assessing the privacy risk of sharing anonymized network data. This includes a model of adversary knowledge, for which we consider several variants and make connections to known graph theoretical results. On several real-world social networks, we show that simple anonymization techniques are inadequate, resulting in substantial breaches of privacy for even modestly informed adversaries. We propose a novel anonymization technique based on perturbing the network and demonstrate empirically that it leads to substantial reduction of the privacy threat. We also analyze the effect that anonymizing the network has on the utility of the data for social network analysis.
-
An integrated, conditional model of information extraction and coreference with application to citation matching
Ben Wellner, Andrew McCallum, Fuchun Peng and Michael Hay
Conference on Uncertainty in Artificial Intelligence (UAI) 2004
pdf X
Abstract Although information extraction and coreference resolution appear together in many applications, most current systems perform them as independent steps. This paper describes an approach to integrated inference for extraction and coreference based on conditionally-trained undirected graphical models. We discuss the advantages of conditional probability training, and of a coreference model structure based on graph partitioning. On a data set of research paper citations, we show significant reduction in error by using extraction uncertainty to improve coreference citation matching accuracy, and using coreference to improve the accuracy of the extracted fields.
-
Exploiting relational structure to understand publication patterns in high-energy physics
Amy McGovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen
SIGKDD Explorations 2003 (Winning entry in the KDD Cup)
pdf X
Abstract We analyze publication patterns in theoretical high-energy physics using a relational learning approach. We focus our analyses on four related areas: understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text. Each of these analyses contributes to an overall understanding of theoretical high-energy physics that could not have been achieved without examining each area in detail.
-
Learning relational probability trees
Jennifer Neville, David Jensen, Lisa Friedland, and Michael Hay
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2003
pdf X
Abstract Classification trees are widely used in the machine learning and data mining communities for modeling propositional data. Recent work has extended this basic paradigm to probability estimation trees. Traditional tree learning algorithms assume that instances in the training data are homogeneous and independently distributed. Relational probability trees (RPTs) extend standard probability estimation trees to a relational setting in which data instances are heterogeneous and interdependent. Our algorithm for learning the structure and parameters of an RPT searches over a space of relational features that use aggregation functions (e.g. AVERAGE, MODE, COUNT) to dynamically propositionalize relational data and create binary splits within the RPT. Previous work has identified a number of statistical biases due to characteristics of relational data such as auto-correlation and degree disparity. The RPT algorithm uses a novel form of randomization test to adjust for these biases. On a variety of relational learning tasks, RPTs built using randomization tests are significantly smaller than other models and achieve equivalent, or better, performance.
-
Avoiding bias when aggregating relational data with degree disparity
David Jensen, Jennifer Neville, and Michael Hay
International Conference on Machine Learning (ICML) 2003
pdf X
Abstract A common characteristic of relational data sets -- degree disparity -- can lead relational learning algorithms to discover misleading correlations. Degree disparity occurs when the frequency of a relation is correlated with the values of the target
variable. In such cases, aggregation functions used by many relational learning algorithms will result in misleading correlations and added complexity in models. We examine this problem through a combination of simulations and experiments. We show how two novel hypothesis testing procedures can adjust for the effects of using aggregation functions in the presence of degree disparity.